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Abstract 

The modelling of data on a spherical surface requires the consideration of directional prob¬ 
ability distributions. To model asymmetrically distributed data on a three-dimensional 
sphere, Kent distributions are often used. The moment estimates of the parameters are 
typically used in modelling tasks involving Kent distributions. However, these lack a rigor¬ 
ous statistical treatment. The focus of the paper is to introduce a Bayesian estimation of 
the parameters of the Kent distribution which has not been carried out in the literature, 
partly because of its complex mathematical form. We employ the Bayesian information- 
theoretic paradigm of Minimum Message Length (MML) to bridge this gap and derive 
reliable estimators. The inferred parameters are subsequently used in mixture modelling of 
Kent distributions. The problem of inferring the suitable number of mixture components 
is also addressed using the MML criterion. We demonstrate the superior performance of 
the derived MML-based parameter estimates against the traditional estimators. We apply 
the MML principle to infer mixtures of Kent distributions to model empirical data corre¬ 
sponding to protein conformations. We demonstrate the effectiveness of Kent models to 
act as improved descriptors of protein structural data as compared to commonly used von 
Mises-Fisher distributions. 

Keywords: Minimum Message Length, von Mises-Fisher, Kent distribution, Protein 

modelling 

1. Introduction 

Directional statistics is a growing discipline with widespread applications in earth sciences, 
meteorology, physics, biology, and other areas. A sample of directional data corresponds 
to a collection of unit vectors. The modelling of directional data has been explored using 
several types of distributions described on surfaces of compact manifolds, such as spheres 
and tori (Fisher, 1953, 1993; Mardia and Jupp, 2000). The most popular amongst these 
distributions is the von Mises-Fisher (vMF) distribution (Watson and Williams, 1956). Its 
probability density function / at any point x on a unit three-dimesional sphere has the 
form: 

/(x;0) oc expj> 7 }x} 

where oc denotes proportionality, © is the parameter vector comprising of the unit mean 
vector 7 i and the concentration parameter k > 0. The vMF distribution is analogous to 
a symmetric Gaussian distribution, wrapped around a unit sphere. As such, it is useful 
for modelling directional data that is symmetrically distributed with respect to a mean 
direction. The modelling of asymmetrically distributed directional data, however, requires 
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distributions which generalize the vMF distribution. A generalization of vMF is called the 
Fisher-Bingham distribution (Mardia, 1975) which takes the form: 

/(x;0) ocexp-t^x + /3 2 (72 x ) 2 + /5 3 (73 x ) 2 } (1) 

where the parameters 71 , 72,73 are unit vectors with 72 and 73 being orthogonal to each 
other, the parameters /?2 and (3^ are real values with f3 2 > P 3 . As the distribution is 
characterized using an 8 real valued parameter vector © (2 for 71 , 3 for 72 and 73 , and 3 
scalars n,j 32 ,f 3 ^), it is also referred to as the FBg distribution. Notice that compared to the 
vMF, the FBg distribution has an exponential factor with additional quadratic terms. 

Given more free parameters, the FBg distribution is a better choice compared to the vMF 
distribution in modelling real world three-dimensional directional data where symmetry 
cannot be assumed. However, the use of the FBg distribution in directional statistics poses 
difficulties owing to its complex mathematical form and also because of a lack of a natural 
understanding of its parameters (Kent, 1982). In order to achieve a balance between the 
highly simplified vMF model and the complex FBg distribution, Kent (1982) suggested an 
alternative form that is relatively easy to work with and whose parameters have natural 
interpretations. This distribution, referred to as the Kent distribution, is obtained from 
Equation 1 by assuming 71 , 72 , 7.3 form an orthogonal system of vectors and are subject to 
the constraint = — /?3 = (3. The probability density function is then given by 

/ ( x ; 0) = c(k, j3)~ 1 exp{fC 7 ]"x + /3[( 7 Jx) 2 - ( 7 jx) 2 ]} (2) 

where 71 , 72 , 7.3 are orthogonal unit vectors representing the mean, major, and minor axes 
respectively; k, as before, measures the concentration, and 0 < (3 < k/2 describes the 
ovalness. 

The Kent distribution was proposed as a spherical analogue of the general Gaussian 
distribution and serves as a natural extension to the vMF distribution. The distribution 
has ellipse-like contours of constant probability density on the spherical surface. Kent 
(1982) argued that by imposing the constraints (3 < k/2 and 71 , 72,73 to be an orthogonal 
system, the distribution would be unimodal and have a behaviour similar to the Gaussian 
distribution but on a spherical surface. 

As the Kent distribution is characterized using a 5 real valued parameter vector © (3 
for 71 , 72,73 because they are orthogonal and unit vectors, 2 for the scalar entities k,/3), 
it is popularly referred to as the FB 5 distribution. We will denote the 5-paranreter Fisher- 
Bingham distribution as FB. 5 (Q, n, (3), where Q = ( 71 , 72 , 73 ) is a 3 x 3 orthogonal matrix. 
The normalization constant c(n, (3 ) of the distribution is derived as an infinite series 

c(k, (3) = 2 tt fy/3 2i 0) 2 V§ («) (3) 

that depends on the Gamma function T and the modified Bessel function I v of the first kind 
and order v (Abramowitz and Stegun, 1965; Kent, 1982). 

The importance of vMF and FB 5 distributions in mixture modelling tasks has been well 
established: vMF mixtures have been used in large-scale text clustering (Banerjee et al., 
2003; Gopal and Yang, 2014), clustering of protein dihedral angles (Dowe et ah, 1996a; 
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Mardia et al., 2007), and gene expression analyses (Banerjee et al., 2005). Mixtures of 
FB 5 distributions have been employed by Peel et al. (2001) to identify joint sets in rock 
masses, and by Hamelryck et al. (2006) to sample random protein conformations. The 
FB 5 distribution has increasingly found support in machine learning tasks in structural 
bioinformatics (Kent and Hamelryck, 2005; Boomsma et ah, 2006; Hamelryck, 2009). 

The analysis of data using FB 5 distributions requires estimating the corresponding pa¬ 
rameters. Due to the complex mathematical form of the density function, these estimates 
are approximated. Kent (1982) derived the moment estimates and suggested limiting case 
approximations. However, the use of simplified approximations can have considerable ef¬ 
fects from a practical standpoint. To overcome this, we explore Bayesian estimation using 
the minimum message length (MML) principle as it results in reliable estimators as shown 
by the experiments in Section 9. 

The parameter inference of a statistical distribution is typically done by maximum like¬ 
lihood (ML) or Bayesian maximum a posteriori probability (MAP) estimation. Bayesian 
inference using MML differs from the traditional approaches as follows: (1) unlike ML, 
MML uses a prior over the parameters and considers their precision while encoding; (2) 
unlike MAP, MML estimators are invariant under non-linear transformations of the pa¬ 
rameters (Oliver and Baxter, 1994). The estimation of parameters using ML ignores the 
cost of stating the parameters, and MAP based estimation uses the probability density of 
parameters instead of their probability measure. In contrast, the MML inference process 
takes into account the optimal precision to which parameters should be stated and uses 
it to determine a corresponding probability value. The MML framework decomposes the 
inference problem into two parts: lossless encoding of the parameters, and encoding the 
data given those parameters. It then selects the parameters that result in the least overall 
message length to explain the data. Thus, models with varying parameters are evaluated 
based on their resultant total message lengths. 

The MML principle has been used in the inference of several probability distributions 
(Wallace, 2005). In particular, the MML parameter estimates of a three-dimensional vMF 
distribution were derived by Dowe et al. (1996b), wherein they demonstrated that the MML 
estimates outperform the traditional ML and MAP based ones. For modelling higher dimen¬ 
sional directional data, Kasarapu and Allison (2015) demonstrated the reliable performance 
of MML-based vMF estimates compared to other traditional estimates. In this work, we 
derive the MML-based parameter estimates of a FB 5 distribution and subsequently use 
them in the mixture modelling. The MML estimates are shown to perform better than the 
traditionally used moment and maximum likelihood estimates. Also, the invariance prop¬ 
erty of MML estimates makes them reliable candidates when compared to MAP estimates. 
We study the results of modelling the protein data using mixtures of vMF and FB 5 distri¬ 
butions. Furthermore, we demonstrate that FB 5 mixture models serve as better candidate 
models when compared to vMF mixtures in modelling protein directional data. 

The paper is organized as follows: Section 2 describes the MML framework and high¬ 
lights the key differences between the MML estimation procedure and others. Section 3 ex¬ 
plains the FB 5 distribution and the associated geometrical construction. Section 4 describes 
the existing moment and maximum likelihood parameter estimates of the FB 5 distribution. 
Section 5 describes the MAP estimation procedure in the context of the FB 5 distribution 
and emphasizes its dependency on the manner the distribution is parameterized. Section 6 
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describes the MML-based estimation of the parameters of the FB 5 distribution. Section 7 
outlines the numerical implementation of methods to compute the normalization constant 
and the corresponding partial derivatives which are required as part of MML-based estima¬ 
tion. Section 8 describes mixture modelling using FB 5 distributions with emphasis on the 
search for the optimal number of mixture components. Section 9 presents the experimental 
results of the various parameter estimation methods. Section 10 discusses the application 
of FB 5 mixtures with respect to modelling protein structural data. Section 11 concludes 
with a summary of the work. 

2. Minimum Message Length (MML) Inference 

In this section, we describe the model selection paradigm using the Minimum Message 
Length criterion and proceed to give an overview of MML-based parameter estimation for 
any distribution. 

2.1 Model selection using minimum message length criterion 

Wallace and Boulton (1968) developed the first practical criterion for model selection based 
on information theory. As per Bayes’s theorem: 

Pr (E&V) = Pr (E) x Px(V\E) = Pr(P) x Pi(E\V) 

where V denotes observed data, and E some hypothesis about that data. Further, Pr("H & 
V) is the joint probability of data V and hypothesis E, Px(E) and Pr (V) are the prior 
probabilities of hypothesis E and data V respectively, Px(E\D) is the posterior probability, 
and Pr(T>| E) is the likelihood. 

As per Shannon (1948), given an event E with probability Pr(Fl), the length of the 
optimal lossless code to represent that event requires 1(E) = — log 2 (Pr(£')) bits. Apply¬ 
ing Shannon’s insight to Bayes’s theorem, Wallace and Boulton (1968) got the following 
relationship between conditional probabilities in terms of optimal message lengths: 

I(E k V) = 1(E) + I(V\E) = I(V) + I(E\V) 

The above equation can be intrepreted as the total cost to encode a message comprising of 
the following two parts: 

1. First part: the hypothesis E, which takes 1(E) bits, 

2 . Second part: the observed data V using knowledge of E, which takes I(V\E) bits. 

As a result, given two competing hypotheses E and E', 

A / = I(E kV)- I(E' k V) = I(E\V) - I(E'\V) bits. 

Hence, Px(E'\V) = 2 AI Px(E\V) 

gives the log-odds posterior ratio between the two hypotheses. The framework provides 
a rigorous means to objectively compare two competing hypotheses. Clearly, the message 
length can vary depending on the complexity of E and how well it can explain V. A more 
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complex 77 may explain T> better but takes more bits to be stated itself. The trade-off 
comes from the fact that (hypothetically) transmitting the message requires the encoding 
of both the hypothesis and the data given the hypothesis, that is, the model complexity 
7(77) and the goodness of fit I{V\W). 


2.2 MML-based parameter estimation 

Wallace and Freeman (1987) introduced a generalized framework to estimate a set of pa¬ 
rameters 0 given data V. The method requires a reasonable prior 7(0) on the hypothesis 
and evaluating the determinant of the Fisher information matrix jj 7 )©)! of the expected 
second-order partial derivatives of the negative log-likelihood function, C(D |0). The pa¬ 
rameter vector 0 that minimizes the message length expression (given by Equation 4) is 
the MML estimate according to Wallace and Freeman (1987). 


/(©,£>) 


d 

2 


log ®- log (j^) 
1(6) ' 


+ £(V |0) + 

'-V- 

I(©l©) 


d 

2 


(4) 


where d is the number of free parameters in the model, and q f j is the 7-dimensional lat¬ 
tice quantization constant (Conway and Sloane, 1984). The total message length 7(0, P), 
therefore, comprises of two parts: (1) the cost of encoding the parameters, 7(0), and (2) the 
cost of encoding the data given the parameters, 7(D|0). A concise description of the MML 
method is presented in Oliver and Baxter (1994). 

The key differences between ML, MAP, and MML estimation techniques are as follows: 
in ML estimation, the encoding cost of parameters is, in effect, considered constant, and 
minimizing the message length corresponds to minimizing the negative log-likelihood of the 
data (the second part). In MAP based estimation, a probability density rather than the 
probability is used. It is self evident that continuous parameter values can only be stated to 

some finite precision; MML incorporates this in the framework by determining the region 

-d/2 

Q 

of uncertainty in which the parameter is located. The value of V = — , ( gives a 

v4r(®4 

measure of the volume of the region of uncertainty in which the parameter © is centered. 
This multiplied by the probability density 7(0) gives the probability of a particular 0 as 
Pr(0) = 7(0) V. This probability is used to compute the message length associated with 
encoding the continuous valued parameters (to a finite precision). 


3. The FB ,5 distribution and its parameterization 

The FB 5 distribution defined by Equation 2 comprises of three directional parameters and 
two scalar parameters. We describe the following parameterization of the distribution that is 
intuitive and relatively easy to comprehend. Let Xi = (10 0) T , X 2 = (0 1 0) T , X 3 = (0 0 1) T 
be the unit vectors along the standard coordinate axes. Let R be the rotation matrix that 
transforms the orientation axes 71 , 72 >73 supporting a FB 5 distribution to align with the 
standard coordinate axes. Then, R T = ( 71 , 72 , 73 ) = Q based on the following reasoning. 

Let a £ [0, -k\ and rj £ [0, 27t] be the co-latitude and longitude that determine the mean 
axis 71 (shown in Figure la). A clockwise rotation by an angle 7 about Xi brings 71 into 


5 



Kasarapu 


the X1X2 plane. This operation transforms the axes to 71,72 >73 respectively (Figure lb). 
A subsequent clockwise rotation by an angle a about X 3 aligns 7) with Xi. This rotation 
brings the major and minor axes into the X2X3 plane (as orthogonality should be preserved). 
In this orientation (Figure lc), let ip E [ 0 ,7r] be the angle between the transformed axis 72 
and X 2 . A clockwise rotation by ip about Xi aligns 72 with X 2 and 73 with X 3 . 




Figure 1: The series of rotations to orient 71 , 72,73 with the standard coordinate axes. The 
red dashed lines indicate the axis that is not in the first octant. For example, in 
(b), 72 is below the X2X3 plane whereas 73 is above the X2X3 plane but behind 
the X1X3 plane. 


If R, ; , R a , R,/, denote the respective rotation matrices given by 



'1 

0 

0 


cos a 

sin a 

O' 


'1 

0 

0 

R r/ = 

0 

cos 77 

sin 77 

J - 

— sin ct 

cos a 

0 

II 

13 - 

Ph 

0 

cos ip 

sin ip 


_0 

— sin 77 

cos 77 


0 

0 

1 


_0 

— sin ip 

cos ip 


then the complete rotation matrix that effects the transformation from ( 71 , 72 , 73 ) to 
(Xi, X 2 , X 3 ) is given by their product R = R^,R a R T/ . By construction, any X, = R 7 ,;, (i = 
1, 2, 3), and consequently, Q = R T . Hence, the three orthogonal axes 7 , of a FB 5 distribu¬ 
tion can effectively be described using the three angular parameters ip, a, 77 as follows: 

71 = (cos a, sin a cos 7 , sin a sin r/) T 

72 = (— cos ip sin a, cos ip cos a cos 77 — sin ip sin 77 , cos ip cos a sin 77 + sin ip cos ?7) T 
73 = (sin ip sin a, — sin ip cos a cos 77 — cos ip sin 77, — sin ip cos a sin 77 + cos ip cos ?7) T (5) 

The parameters k and f3 are interpreted as scalars controlling the concentration and 
ovalness of the distribution. Also, since the distribution has ellipse-shaped contours on the 
spherical surface, it is easier to visualize the distribution and relate k and /3 terms using 
eccentricity. Kent (1982) defined the eccentricity 1 as 2 fd/n, which is constrained to be less 

1. The definition of eccentricity in this context differs from the traditional definition of eccentricity for a 
conic section such as a parabola, an ellipse, or a hyperbola defined in the Euclidean plane. 
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than 1 (by definition), allowing correspondence between a specific Kent distribution and its 
elliptical nature. In order to better understand the interaction of k and eccentricity terms, 
we provide examples in Figure 2. 



Figure 2: An example of a FB 5 distribution with varying eccentricities for k = 10. 

For a given n = 10, an eccentricity of 0.1 results in (almost) spherical contours (Fig¬ 
ure 2a) (reminiscent of a vMF distribution); an eccentricity of 0.5 results in contours which 
are moderately eccentric (Figure 2b); an eccentricity of 0.9 further disperses the data along 
the major axis (Figure 2c). 

4. Existing methods of parameter estimation of the FB 5 distribution 

The traditional methods of maximum likelihood (ML) estimation or maximum a priori 
(MAP) based estimation require the optimization of negative log-likelihood or the posterior 
density functions respectively. They, however, don’t result in closed form solutions and 
present difficulties because of the complex form of the probability distribution. Hence, 
the widely used method of estimating the parameters of a FB 5 distribution is done using 
moment estimation. Kent (1982) formulated a procedure to obtain these estimates that may 
be subsequently used as starting points to obtain the ML or MAP estimates. Kent (1982) 
derived the moment estimates and suggested approximations based on these estimates. 

4.1 Moment estimation 

The moment estimates were proposed as an alternative to the maximum likelihood esti¬ 
mates. The approach adopted by Kent (1982) is described here: let data V = {xi,..., xjv} 
be a random sample from FB, 5 (Q, n, f3). The sample mean x and sample dispersion 3x3 
matrix S of the data are then given as: 

1 N | N 
x=-^ Xi and S = -^x, t x 7 

i=l i= 1 

Let k, /3, Q = (7i,72)7s) be the respective moment estimates of k,/3, and Q. Then the 
moment estimate 77 of the unit mean vector is obtained by normalizing x. The moment 
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estimates 7 2 and 7 3 are obtained by diagonalizing S. The matrix Q is obtained using the 
following two steps: 

1. Choose an orthogonal matrix H to rotate x to align with the Xi = (10 0) T axis 
(based on the discussion in Section 3, H = R a R^, where a and rj are the co-latitude 
and longitude of x respectively). Let B = H T SH, so that B is the dispersion matrix 
in the transformed frame of reference. 

2. The moment estimates of the major and minor axis correspond to the respective 
directions of maximum and minimum variance of the data in this transformed reference 
frame. If the angle between the direction of maximum variance and the X 2 = (0 1 0) T 
axis is ip, then a rotation defined by the orthogonal matrix K about Xi by ip, aligns 
the maximum and minimum variance directions with the X 2 and X 3 = (0 0 1) T axes 
respectively. To compute these directions, it is required to diagonalize B^, the lower 
2x2 submatrix of B. The eigenvalue decomposition of B^ gives the angle ip between 
the maximum variance direction and X 2 , which can be subsequently used to determine 
K. If the 3 X 3 dispersion matrix B = [b tJ \, 1 < i,j < 3, the expression for ip is 

b-22 ^23 

&23 bs 3 _ 

The two rotations defined by the orthogonal transformations H followed by K transform the 
axes of a FB 5 distribution to align with the standard coordinate axes. In effect, the original 
data V is transformed to V = {yi,..., y^v} such that V corresponds to a random sample 
drawn from FBs(I, k, (3), where I is the identity matrix. Hence, an inverse transformation 
of the coordinate axes yields the moment estimates Q of the axes of the FB 5 distribution. 

Further, for y = (yi, y 2 , y 3 ) T , Kent (1982) provided the moment expressions given below: 

E[yi] = c K /c, E[y 2 - yl\ = cp/c, where c = c(k,/3),c k = dc/dn,cp = dc/d/3 (7) 

For data V, if ||x|| is the magnitude of the sample mean x and l\ > I 2 are the eigenvalues 
of B^, then Kent (1982) defines the shape and size and quantities as r\ and r 2 respectively 
and are given as 


tan 2 ip = 


2623 

&22 — &33 


where B r = 


ri =E[yi] = ||x|| and r 2 = E[y 2 - y|] = h - l 2 (8) 

Hence, solving these two simultaneous equations in conjunction with Equation 7 results in 
the moment estimates k and j3. As the expressions of the partial derivatives c K and eg are 
difficult to work with, the following limiting case approximations of k and (5 are often used. 

k « (2 - 2n - r 2 )~ l + (2 - 2n + r 2 ) _1 
P ~ ^{(2 - 2n - r 2 ) _1 - (2 - 2n + r,)" 1 } (9) 

These asymptotic approximations can also be used as starting points to accurately determine 
k and f3 using an optimization library. 



MML INFERENCE OF KENT DISTRIBUTIONS 


4.2 Maximum likelihood estimation 

To obtain the maximum likelihood estimates, the negative log-likelihood function C(T> |0) 
of the data V, given by Equation 10, needs to be minimized. It is to be noted that 71 , 72,73 
are expressed in terms of Y>,ai, 7 (Equation 5), so that © = {-0, a, 7 , n, /3} is a vector of 
parameters. 


N 


£(£>|0) = N log c(k,/3) - «7 - 




X ? X„- 


72 + ^73 


i—1 



( 10 ) 


dC 

The maximum likelihood estimates are given as solutions to the equation —— = 0. These 

o© 

estimates are obtained through numerical optimization as the solution cannot be written in 
an analytical form. The optimization routine often requires some initial values of the roots. 
These starting points are taken to be the moment estimates that were discussed previously. 


5. Maximum a posteriori (MAP) based parameter estimation 

The moment estimates of a FB 5 distribution are typically used in a variety of applications 
(Peel et ah, 2001; Kent and Hamelryck, 2005; Boomsma et ah, 2006; Hamelryck et ah, 
2006). In this section, we explore MAP based parameter estimation, which we will later 
use in our discussion to compare the various estimators (see Section 9). The estimation 
procedure requires the maximization of the posterior density given some observed data V. 
If h(@) is an appropriate prior density of the parameters and Pr(D|0) is the likelihood of 
data given the parameters, then the posterior density Pr(0|D) is given as 

Pr(0|D) oc h(&) x Pr(Z>|0) 

For an independent and identically distributed sample V = {xi,..., x^r}, and a probability 

N 

distribution /(x;0), the likelihood term Pr(T>|0) = J^/(x*;0). The MAP estimator 

i— 1 

corresponds to the mode of the posterior distribution. The mode is, however, not the same 
under varying parameterizations. As a result, the MAP estimate is not invariant under 
some non-linear transformation of the parameter space (Murphy, 2012). This drawback is 
exemplified in the context of estimating the parameters of a FB 5 distribution. A prior /i© 
is described on the parameter vector 0. It is formulated based on the choice of priors for 
the individual elements of the parameter vector. We also consider its reparameterization in 
a transformed space and demonstrate that the modes of the posterior in these alternative 
parameterizations are not given by the same transformation of the parameter space. 

5.1 Prior density of the parameters 

The formulation of the prior density of the 5-parameter vector © is derived as a product of 
the priors of the three angular parameters -0, a, and two scalar parameters n, (3. Hence, the 
prior density of the complete set of parameters is given by 7©(V>, 7 , k, ft) = ^a(V’, a i 7 ) x 
h s (n,P). 
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5.1.1 Prior density (h A ) on the angular parameters 


By construction (see Section 3), the pair a,?/ uniquely defines the mean direction 71 of a 
FB 5 distribution. The mean may be considered to be uniformly distributed on the spherical 

sin q/ 

surface, and hence, its prior density is -. The angle V’ which determines the orientation 

4vr 

of the major and minor axis in a plane perpendicular to 71 is treated to be uniformly 
distributed on [ 0 , 7 r]. The joint prior of the angular parameters is, therefore, given by 
sina 

h A {'ip,a,ri) = 


5.1.2 Prior density (hs) on the scale parameters k,(3 

The prior of the concentration parameter k corresponds to the one used by Dowe et al. 
(1996b) in their analysis of vMF distributions defined on the two-sphere and is given as: 
4 K 2 

/i(/c) = — -^77. For a given k, as per the definition of a FB 5 distribution, the pa- 

7r( 1 + K Z ) Z 

rameter (3 € [0, /c/2). A uniform prior is considered for (3 within this range, that is, the 
conditional density h{(3 |/c) = 2/k. Therefore, the joint prior density of the scalar parameters 
is hs(n, f3) = (2 /k)Ii(k). The joint prior density h® is, hence, given as: 


h&(ip, a, 77 , k, (3) 


2k sin a 

7T 3 (1 + K 2 ) 2 


( 11 ) 


5.2 Non-linear transformations of the parameter space 

The reason for considering another parameterization is to show that MAP estimates are 
not invariant under non-linear transformations of the parameter space. If T(0) = &' 
denotes a transformation T on the parameter vector 0, then for invariance, the parameter 
estimates in both the parameterizations should be affected by the same transformation. 
The parameter estimate 0' in the transformed space and the estimate 0 should be related 
as T(0) = 0 7 . With the help of an example, we demonstrate that the invariance property 
is not a characteristic of MAP-based estimation, thus, making it an inconsistent estimator. 


5.2.1 An alternative parameterization involving (3 


An alternative parameterization is considered where the eccentricity e = 2(3 /k (see Sec¬ 
tion 3) is used instead of (3. This is an example of a non-linear transformation of the 
parameter (3. The prior density //,©/ (Equation 12) of the modified parameter vector 
0' = {i/>, a, 7 , k, ej is obtained by dividing the prior density h® by the Jacobian of the 
transformation given by J = de/d/3 = 2/k. The prior density h®> (after reparameteriza¬ 
tion) is: 


//©/(V’,a,? 7 ,/c, e) 


//.©(■)/, a, 7), k, (3) /c 2 sin a 

J 7T 3 (1 + K 2 ) 2 


( 12 ) 


5.2.2 Alternative forms of the posterior distribution 

Based on the definitions of prior densities in varying parameter spaces, one can estimate 
the parameters by maximizing the posterior density in the corresponding parameterization. 
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The different expressions for the posterior density are summarized here. 

N 

Posterior (©I'D) oc /i©(?/>, a, 77 , k, (3) x n /(x*;©) 

i= 1 
N 

Posterior) 0'\V) oc /?,©/(■)/, a, 77 , n, e) x ^/(xj;© 7 ) (13) 

2—1 

The expression for /(x, ©') is obtained by substituting (3 = 7ce/2 in the FB 5 probability 
density function /(x, 0) (given by Equation 2). It should be noted that the value of 
likelihood expression is the same across different parameterizations. 

5.3 An example demonstrating the effects of alternative parameterizations 

An example of estimating parameters using the various posterior distributions for a given 
dataset is shown here. A random sample of size N = 10 is generated from a FB 5 distribution 
(Kent et al., 2013). The true parameters of the distribution are {■)/, a, 77 } = ir/2 each, k = 10, 
and (3 = 2.5 (eccentricity = 0.5). To obtain the MAP estimates, the objective functions 
corresponding to the posterior density (Equation 13) need to be maximized. To solve for the 
parameter estimates, the non-linear optimization library NLopt (Johnson) in conjunction 
with derivative-free optimization (Powell, 1994) is used. Maximizing the two versions of the 
posterior density results in the following MAP estimates of i/j, a, 77 : 

ijj = 2.071, a = 1.493, 77 = 1.522 using /r© 
ijj = 2.071, a = 1.493, rj = 1.522 using /r©/ 

It is observed that the MAP estimates of ?/>, a, 77 are not different from their counterparts 
obtained using the two variations of the posterior density. However, the estimates k and 
/3 under the parameterizations 0 and © 7 do not correspond to each other as illustrated in 
the results below. 

k = 17.023, (3 = 5.493 using /?,© 
k = 20.549, e = 0.701 /3 = Ice /2 = 7.199 using /?.©/ 

Ideally, the values of fc and (3 obtained through the use of h®> should be the same as that 
obtained when the posterior density is maximized using /?,© prior. Clearly, with MAP-based 
estimation, the end results are different for the two cases. 

The modes of the posterior in the n, (3 and k, e parameterizations are shown in Fig¬ 
ure 3(a) and (b) respectively. It is expected that the modes of the posterior shift as per the 
parameter space. However, they should be invariant regardless of the transformation affect¬ 
ing the two parameter spaces. It is observed that the mode in k, e space, when mapped back 
to the k, f3 space, results in a posterior density as shown in Figure 3(c). This is different 
from the posterior density shown in Figure 3(a), as the modes are at different locations. 
We emphasize that the invariance property of parameter estimates is central to inductive 
inference. The example considered here shows that MAP estimation of the parameters of 
a FB .5 distribution does not satisfy the invariance property, thus resulting in unreliable 
estimators. 


11 



Kasarapu 



Figure 3: Heat maps depicting the modes (MAP estimate) of the posterior density as a 
function of k, f3 and n, e parameterizations. The Z-axis denotes the posterior 
density value in the respective parameterization. 


The aforementioned eccentricity transform is a straightforward transformation involving 
/ 3 . The remaining four parameters are left unchanged in this case. Another parameterization 
involving all five parameters of the FB 5 distribution is outlined in Appendix A. 

6. MML-based estimation of the parameters of the FB 5 distribution 

We now shift our focus to deriving the MML-based parameter estimates of a FB 5 distri¬ 
bution which is among the main contributions of this work. As explained in Section 2 , 
derivation of the MML estimates requires the formulation of the message length expression 
(Equation 4) for encoding some observed data using the FB 5 distribution. The formulation 
requires the use of a suitable prior density on the parameters (see Section 5.1). The prior 
for k is taken as h{n) (see Section 5.1), the prior of k for the vMF distribution on the 
two-sphere (Dowe et al., 1996b). This results in the joint prior density h®('ip,a,r), k, f3) 
(Equation 11). The main bottleneck involved in the MML-based parameter estimation is, 
however, the evaluation of the Fisher information matrix. As shown later, its computation 
involves the first and second order moments corresponding to a FB 5 distribution. The de¬ 
tails are presented here. 

Notations Before we proceed with describing the approach based on MML inference, we 
define the following notations which are used subsequently. We require the use of partial 
derivatives of the normalization constant c(a t, (3) given by Equation 3. The following are 
the adopted notations to represent them. 

c(k,/3) = c, c K = dc/dK, c /3 = dc/d(3 
c KK = d 2 c/dn 2 , cpp = d 2 c/d/3 2 , c K p = d 2 c/dKdj3 

6.1 Derivation of the moments of a general FB 5 distribution 

Kent (1982) provided the moment expressions in the case of a FB 5 distribution whose 
mean, major and minor axes are aligned with the standard coordinate axes. In this setup, 
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consider a random vector y ~ FB 5 (I, ac, /5), where I is the identity matrix. Then, Kent 
(1982) provided the following moments: 


Ai 


E[y] = (c K /c 0 0 ) T , and 
/ Ai 0 0 

0 a 2 0 
\0 0 a 3 


E[yy T ] = A = 



C/s. . C Ckk + Cp 

= 7 >A2= -7- >Aa= 2 c 


(14) 


We derive here the moments in the case of a general FB 5 distribution, that is, whose three 
mutually orthogonal axes can be oriented in any fashion. Let x FB 5 (Q, k,/3), a generic 
distribution whose axes are not aligned with the coordinate axes. Recall, from Section 3, 
that Q is the rotation matrix that aligns the standard coordinate axes with the axes of a 
FB 5 distribution. Based on the parameterization of the FB 5 distribution, we can deduce 
that Vx, 3 y ~ FB5(I,k, f3) such that x = Qy, and hence, xx T = Qyy T Q T . Using the 
results from Equation 14, we have 


E[x] = Q E[y] = (71 72 73 ) (c K /c 0 0) T = c K /c 71 
and E[xx t ] = QE[yy T ] Q T = QAQ T 


(15) 


6.2 Computation of the Fisher information 

The computation of the determinant of the Fisher information matrix requires the eval¬ 
uation of the second order partial derivatives of the negative log-likelihood function with 
respect to the parameters of the distribution. As per the density function (Equation 2), the 
negative log-likelihood of a datum x is given by 

£(x|0) = log c(k,/3) - kjJx - /? 72 x x T 7 2 + /S'yjxx 7 ^ (16) 


For the FB 5 distribution, the 5-paranreter vector © = {V>, a, r/, k, /?}. Let J-i(@) denote 
the Fisher information for a single observation. The Fisher information matrix F\ (©) 
associated with the parameters of a FB 5 distribution is a 5 x 5 symmetric matrix whose 

T d 2 C 

(i,j) th element corresponding to parameters Oi, 6j e 0 is Fogij = E 


99,00 


3 J 


Further, 


as explained later, the determinant |Jr( 0 )| is decomposed as a product of \Fa\ and l-Ts), 
where J-a is the Fisher matrix associated with the angular parameters a, 7 , and J~s is 
the Fisher matrix associated with the scale parameters n, f3. 


6.2.1 Fisher matrix {Fa) associated with 4>,a,ri 

Fa is a 3 x 3 symmetric matrix whose elements are the expected values of the second order 
partial derivatives of C with respect to 9i,9j € {ip, a,??}. Let the expectation be given as 


E 


' d 2 C ' 

9 9 id 9j 


—kT( 7i ) — /3T( 72 ) + /3T( 73 ) 


(17) 
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where the individual terms T(y m ), m G {1,2,3} are comprised of the expectations of the 
corresponding partial differentials of i m . They are computed using the following identities: 


n 71 ) 

T{ r tm) 


= E 


= E 


"5 2 (7i 


x 


dOidOj j 
’3 2 (7 T 


= E[xl 


<9 2 7 


7 T’ and 


m xx T 7 m ) 


dOidOj 


= 2 


xx 


T 


3 2 7 m 
J do, 00: 


dOidOj 
(for m = 2,3) 
dy 


+ 


OOi 


E 


xx 


Ojm. 

OOj 


(18) 


The terms T( 7 m ) depend on the expressions for the constituent first and second order 
partial differentials of 7 m , which we provide in Appendix B. Using Equations 15, 17 and 
18, the elements of the Fisher information matrix J- a are derived as follows: 


T^ = 4ficp/c; J : aip = 0 ; J 7 ^ = (cos a) 4/3 cp/c 

Taa. = K c K /c + 2/3 {(Ai - A 3 ) sin 2 ip - (Ai - A 2 ) cos 2 V’} 

= /5(1 — 3Ai)sin2'0 sin a 
= (sin 2 a) k c K /c 

A 2 (cos 2 ip cos 2 a + sin 2 ip) + (A 2 — A 3 ) cos 2 a 
—A 3 (sin 2 ip cos 2 a + cos 2 ip) + A 3 sin 2 a cos 2 ip 

6.2.2 Fisher matrix (Ts) associated with k,/3 

J~s is a 2 x 2 symmetric matrix whose elements are the expectations of the second order 
partial derivatives of C with respect to k and f3. From Equation 16, we have 




OC 

Ok 


c k t j dC op t t . t t 
— -7i x and — = — - 7 2 xx 7 2 + 73 xx 73 


d 2 C 

Ok 2 

d 2 C 

W 2 

d 2 C 

Ondp 


n2 

— T 

— ^ K 


CC/3/3 ~ Cp 


= J-pf 5 , and 






k/3 


( 20 ) 


6.2.3 Fisher matrix J 7 )©) associated with the 5-parameter vector 0 

dim 


It is to be noted that for 0, G {k, /3} and 0 3 G {ip,a,ij}, T(y m ) = 0 as 


OOi 


= 0 (lr 


given by Equation 5 are independent of k,(3). Consequently, = 0. This allows for the 
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computation of |Jt(@)| as the product of \Fa\ and iT-s). Thus, 



ifj'tjj 

F if>a 

iprj 

0 

0 



OLXjj 

act 

a?? 

0 

0 


l^l(0)l = 

rjxp 

qa 


0 

0 



0 

0 

0 

3F kk 

■Fk(s 



0 

0 

0 

Fpn 

Fpp 



Then, the Fisher information for some observed data V = {xi,..., x^r} is given by |.F(©)| = 
A r5 |J r i(@)| (as each element in |J r i(0)| is multiplied by the sample size N). 

6.3 Message length formulation 

The message length to encode some observed data V can now be formulated by substituting 
the prior density h<~> (Equation 11), the Fisher information |.F(0)| and the negative log- 
likelihood of the data (Equation 10) in the message length expression (Equation 4). The 
MML parameter estimates are the ones that minimize the entire message length. As there 
is no analytical form of the MML estimates, the solution is obtained, as for the maximum 
likelihood and MAP case, by using the NLopt 2 optimization library (Johnson) . At each 
stage of the optimization routine, the Fisher information needs to be calculated. However, 
this involves the computation of complex entities such as the normalization constant c(k, f3) 
and its partial derivatives. The computation of these intricate mathematical forms using 
numerical methods is discussed in Section 7. 


7. Computation of the normalization constant and the associated 
derivatives 

The computation of the negative log-likelihood function and the message length is hindered 
because of the presence of the normalization constant and its associated derivatives. Kent 
(1982) provided an asymptotic formula for c(k, (3) as 27rexp(«)[(K 2 — 4/3 2 )] -1 / 2 . However, 
this approximation is valid for large k and when 2/3/k is sufficiently small. In this section, we 
describe the methods that can be employed to efficiently compute these complex functions 
without making any assumptions. 


7.1 Computing logc(/c,/3) and the logarithm of the derivatives: c K = dc/dn 
and c KK = d 2 c/&K 2 

The expressions of c,c K ,c KK are related, to each other. These are explained by defining 

the quantity s[ m \ a logarithm sum where m G {0,1,2}, p = 2j + -, Ai = 2ir\ — , and 

Z V K> 

2/3 

e = — <1 (by definition). 


‘Si m) = lo g^i + lo § p(j + i) e2J/pfro (”) 


i=o 


( 21 ) 


fj 


2. http://ab-initio.mit.edu/nlopt 
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Computation of the series S'{ m ' 1 : We first establish that fj +1 < fj V) > 0 and show that 
converges to a finite sum as j —> oo. Consider the logarithm of the ratio of consecutive 
terms /j and /j+i in s["‘\ 

log = log J ^-j + 2 loge + log 1 p+"‘+ 2 ^ (22) 

fj 3 + 1 Jp+m(«) 

For p, v > 0, / p+ „ < Ip, and the ratio I - E p }L —> 0 for large v (Amos, 1974). Further, e < 1 

implies the above equation is the sum of negative terms. Hence, log 4±1 < 0, which means 

Jj 

fj +1 < fj- Also, 

Aii ^2 i + 1+2( K ) 

lim log -A— =0 + 2 log e + lim log —— = —oo 

oo fj j->o o I 2 j+i(«) 

Hence, as lim — = 0, 5{ m ' 1 is a convergent series. 
j^°o fj 

For practical implementation of the sum, we express S'[ ,rd as the modified summation, 

OO 

Sj m) = log <5i + log / 0 + log ^ tj (23) 

i=o 

where each fj is divided by the maximum term ,/q. For each j > 0, log fj is calculated using 
the previous term log /)_i (Equation 22). The new term tj = fj/fo is then computed 3 
as exp(log fj — log/o) (computing the difference with the maximum value and then taking 
the exponent ensures numerical stability). The summation is terminated when the ratio 

—- < e (a small threshold ~ 10 -6 ). 

EL 14 

• Let 5(c) = log c(k,P): Substituting m = 0 in Equation 21 gives the logarithm of the 
normalization constant (given in Equation 3). Hence, 5(c) = 5j°\ 

• Let the j th term dependent on k in Equation 3 be represented as gj(n) = I p / k p , where 
I p implicitly refers to I p (n). We use the relationship between the Bessel functions 
Ip, Ip- 1 , and the derivative I' p in Equation 24 (Abramowitz and Stegun, 1965), to 
derive the expressions for the first and second derivatives of gj{n) (Equation 25). 

kI' p = kI p -i -pip (24) 

s'jD = and LL) = I ~ e y l + ( 25 ) 

J kP j kP k kP 

Let S(c K ) = logc K : Because of the similar forms of gj(n) and g'j(n), the expression 
for S(c K ) will be similar to 5(c) with a change in order of the Bessel functions from 
m = 0 in Equation 21 to m = 1. Hence, S(c K ) = 5| 1J and an expression akin to 
Equation 23 can be derived for 5(c K ). 

3. Because of the nature of Bessel functions, log fj can get very large and can result in overflow when 

calculating the exponent exp(logD. However, dividing by fo results in fj/fo < 1. 
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Let S(c KK ) = logc KK : Substituting m = 2 in Equation 21 gives the logarithm sum 

S', corresponding to the series with terms p+2 . Based on the nature of g'- (k) 

(Equation 25), and noting that S(c K ) > S\~ (as I p + \ > I p +2 Vp > 0), we can formulate 
S(c KK ) as given below. 


S(c KK ) = S(c K ) + log ( exp(S'| 2) - S(c K )) + - 


7.2 


The logarithm of the derivatives: op = dc/d(3 , c K p = d 2 c/dK,d(3, and 
C/ 3/3 = d 2 c/d(3 2 


The expressions of eg and c K g are related and are explained using the logarithm sum S^ 


(n) 


47T 


2/3 


where n £ {0,1}, 82 = — \f — , p = 2 j + and e = — 

P 



log <y 2 + log 



rq+j) 

r(j) 


e 3/p_[_ n (/^) 

--' 

fj 


(26) 


We note that S^ is a convergent series (proof is based on the same reasoning as in Sec¬ 
tion 7.1). 


Let the j th term dependent on /3, n in Equation 3 be represented as k) = /3 2jl -^. 


(n) 

Its partial derivatives are given below. These derivatives are the terms in the series ' 
(after factoring out the common elements as 82 )- 


% 

d/3 


2 j /3 2j_ 1 — and 


d 2 ffj 

dndj3 


2j/3 2j_1 


3p+i 


• Let 5(c^) = logcg: this is obtained by substituting n = 0 in Equation 26. Hence, 

S( C/3 ) = sf\ 

• Similarly, S(c K( g) = logc K/3 = 

• The expression to compute S(cpp) = log cpp is given by 

s(cpp) = log + log F r ~ ^ e2JIp W 

J = 1 V----' 

fj 


The practical implementation of sip and S(cpp) is similar to that of s[ m ^ given by Equa¬ 
tion 23. However, in these cases, the expressions of fj and consequently tj , are modified 
accordingly. Also, the series begin from j = 1, and hence, the maximum terms will corre¬ 
spond to / 1 . 
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8 . Mixture modelling of FB 5 distributions 


In this section, we provide an overview of the mixture modelling apparatus in the context 
of modelling directional data using FB 5 distributions. The probability distribution of a 
mixture At is of the form: 

I< 


/(x; <S>) = ^2 0 i) 

3 =1 

where K is the number of component FB 5 distributions, Wj > 0 is the component weight 
such that J2f= 1 w j = 1> an d ®j denotes the 5-paranreter vector of the j th FB 5 distribution. 
The parameters of the mixture are collectively given by $ = {tci, • • • , wk, © 1 , • • • , ®k}- 


8.1 Estimating the mixture parameters 

For a mixture with I\ number of components, the traditional method of estimating the 
mixture parameters is done by minimizing the negative log-likelihood function of the data 
given by 

N K 

£(£>|$) = -J^log (27) 

i =1 3 =1 

where V = {xi,..., x^v} is the observed data of size N. The maximum likelihood estimation 
procedure, in this case, involves an expectation-maximization (EM) algorithm (Dempster 
et ah, 1977; Krishnan and McLachlan, 1997) which is decomposed into the following steps: 

• Expectation (E-step): The membership of each datum x,; G 77 in a mixture component 
j G {1 ,K} is updated as: 


nj = 


tt'ifixj: (“);) 
Efc=l w kf{^u ®k)' 


N 

and nj = ^2 r'ij 
i= 1 


where the table of memberships is termed the responsibility matrix and rij is the 
effective membership of the j th component. 


• Maximization (M-step): The parameters of each component are updated by their 
respective maximum likelihood estimates. These are obtained by minimizing C(T> |3>). 
Differentiating Equation 27 with respect to ©j leads to the following modified form 





(28) 


where 0j = {tpj, ctj, rjj, Kj, f3j} and 7 ^-, 72 ^, 7 . 3 ^ are functions of ipj , a.j , r)j (Equation5). 
The above equation resembles the negative log-likelihood function due to a single 
FB 5 component (Equation 10) after accounting for the partial memberships of data 
within that component. Minimizing Equation 28 yields the maximum likelihood esti¬ 
mate of 0j. The component weights are updated as Wj = nj/N. 
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8.2 Estimating the mixture parameters using the MML framework 

The seminal work on minimum message length inference of mixture models was carried out 
by Wallace and Boulton (1968). As per the MML framework (Section 2), the estimation 
of parameters of a mixture distribution requires the encoding of the parameters and the 
data given those parameters. The resultant total message length expression needs to be 
minimized to obtain the MML estimates. The formulation of a mixture modelling problem 
using MML framework can be decomposed into: 

1. First part: encoding the mixture parameters 3>, namely, number of components K, 
mixture weights w = {uq,... ,iok}, and the component parameters &j. 

2. Second part: encoding the data V given the parameters <J>. 

The schemes for encoding K and w are generic (Wallace, 2005) and are summarized in 
Kasarapu and Allison (2015). However, encoding the component parameters requires 
the evaluation of the corresponding Fisher information. Using the appropriate encoding 
schemes, the general form of the total message length expression provided by Wallace and 
Freeman (1987) is: 


I< 

/($,£>) = I{K) + /(w) + V I(Bj)+£(V\&) + constant (29) 

v --- ' 

v J y second part:/(D|<I?) 

first part: /(3>) 


where I(K) and /( w) are the message lengths to encode K and w respectively. The 
cumulative Fisher information of mixtures for the components’ parameters is given by the 

h(Qj 


K 


summation , where I(&j 

j =i 


= - log 


V\^jT\ 


is the message length to encode the 


parameters of the j th component (Section 2.2). The second part of the message I(T> |<F) is 
a measure of the goodness of fit to the data and corresponds to the negative log-likelihood 
(Equation 27). 

To obtain the MML estimates, an EM algorithm is employed to minimize the two-part 
message length /(<!>, 77). In the E-step, the memberships of the data are updated, while in 
the M-step, the component parameters are updated using their respective MML estimates 
(Section 6.3). The EM algorithm is continued until there is no change in message length, 
that is, when the algorithm converges to a local minimum. 


8.3 Determining the optimal number of mixture components 

The parameters of a mixture can be estimated once the number of mixture components 
are known. A mixture modelling problem also needs to address the issue of selection of 
optimal number of components. The EM algorithms that are used in the estimation of 
mixture parameters are carried out with a fixed number of components. As the number of 
mixture components K increases, the negative log-likelihood (Equation 27) decreases and 
consequently results in an improvement to the quality of fit to the data. However, increasing 
K results in the mixtures becoming overly complex. Thus, a reliable tradeoff in terms of 
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balancing the model complexity and the quality of fit should be achieved. There are two 
aspects concerning the determination of suitable number of mixture components: 

1. a scoring function to evaluate a given mixture 

2. a search strategy to infer such a mixture 

Scoring function: There have been numerous scoring functions proposed in the literature 
that aim to evaluate a given mixture model. A review of these methods is presented in 
McLachlan and Peel (2000). The common motivation is to balance the model complexity 
and the quality of fit. The scoring functions which quantify the model complexity based on 
the number of components are Akaike Information Criterion (AIC) (Akaike, 1974), Bayesian 
Information Criterion (BIC) (Schwarz, 1978; Rissanen, 1978), and Integrated Completed 
Likelihood (ICL) criterion (Biernacki et ah, 2000). It is to be noted that AIC and BIC are 
shown to be approximations of the general MML framework (Figueiredo and Jain, 2002). 
The information-theoretic criteria that account for not just the number of components but 
also the components’ parameters are ICOMP (Bozdogan, 1993), Laplace empirical crite¬ 
rion (LEC) (Roberts et al., 1998), and approximated MML criterion (Oliver et ah, 1996; 
Figueiredo and Jain, 2002). Further, these criteria are derived using a MML interpreta¬ 
tion. However, as detailed in Kasarapu and Allison (2015), these criteria are oversimplified 
versions of the generic MML framework and are incomplete in objectively addressing the 
tradeoff associated with selecting a suitable mixture model. 

Search strategy: To determine the optimal number of mixture components using the afore¬ 
mentioned criteria (Akaike, 1974; Schwarz, 1978; Oliver et ah, 1996; Roberts et ah, 1998; 
Biernacki et ah, 2000), mixtures are inferred for varying number of components using the 
EM algorithm, and the mixture that has the least score is treated as the optimal one. As 
the EM only guarantees convergence to a local optimum, a few trials are conducted with 
different starting points in an effort to minimize the possibility of getting trapped in a local 
optimum (Krishnan and McLachlan, 1997; McLachlan and Peel, 2000). In order to rectify 
the issues arising from the use of EM method which plays a central role in identifying the 
right mixture model, methods based on iteratively splitting and merging constituent mix¬ 
ture components have been proposed so as to enable the intermediate mixtures to escape 
from local optima. The notable amongst these are split-merge based EM (SMEM) method 
proposed by Ueda et ah (2000) and component-deletion based unsupervised learning ap¬ 
proach proposed by Figueiredo and Jain (2002). 

Given a mixture with K components, the SMEM method selects the top three candi¬ 
dates, merges two of them, and splits the other into two, thus, leaving the effective number 
of components unchanged. Further, the potential candidates are chosen depending on the 
improvement to the complete data log-likelihood function (used to formulate the ICL crite¬ 
rion). In contrast, the method of Figueiredo and Jain (2002) starts off by assuming a large 
number of components and iteratively eliminates those that are deemed redundant as per 
their objective function (a simplified MML-like formulation). Figueiredo and Jain (2002) 
demonstrated their competitive edge against the contemporary BIC (Schwarz, 1978), LEC 
(Roberts et ah, 1998), and ICL criterion (Biernacki et ah, 2000). 

The SMEM algorithm does not facilitate an increase or decrease in the mixture size. In 
contrast, the method of Figueiredo and Jain (2002) progressively reduces the mixture size, 
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and hence, has no provision for recovering a component if it is deleted by chance. Also, the 
assumptions made in formulating their MML-like scoring function lack the rigor to objec¬ 
tively weigh the mixture model complexity against the quality of data fit, as explained in 
Kasarapu and Allison (2015). In order to address the limitations resulting from approxi¬ 
mating the scoring functions and the search strategies, more recently, Kasarapu and Allison 
(2015) proposed a search heuristic in conjunction with a comprehensive MML formulation 
(with no approximations) to infer a suitable mixture model. This was demonstrated in 
the context of inference of mixtures of multivariate Gaussian and vMF distributions. In 
our previous work (Kasarapu and Allison, 2015), we have established that the proposed 
approach outperforms the widely used method of Figueiredo and Jain (2002). 

8.4 The optimal number of components of a FBs-component mixture 

We briefly review the search method of Kasarapu and Allison (2015) here that extended the 
MML-based Snob program (Wallace and Boulton, 1968; Wallace, 1986) for unsupervised 
learning. The method is adapted to the present scenario of mixture modelling of FB 5 dis¬ 
tributions. The general idea is to perform a series of perturbations (split, delete, and merge 
operations) to a current sub-optimal mixture to obtain an improved mixture with a lower 
message length. The method begins by assuming a one-component mixture. The mixture 
is split into two children which are locally optimized. If the resultant mixture has a lower 
message length, the current mixture is updated. If, at any stage, a mixture has K compo¬ 
nents, each component is separately split into two, deleted, and merged with an appropriate 
component. The split operation results in a (K + l)-component mixture, while the delete 
and merge operations result in (K — l)-component mixtures. Each of the intermediate mix¬ 
tures are optimized using an EM algorithm (Section 8.2). The perturbation corresponding 
to a component that results in the greatest reduction in message length is considered. This 
heuristic exhaustively considers all possible operations giving the JF-component mixture the 
best chance to escape from a sub-optimal state. The method terminates when none of the 
perturbations result in improved mixtures. Each of these operations are explained below in 
the context of FB 5 distributions. 

8.4.1 Splitting a component 

The split operation is critical as it leads to mixtures with greater number of components. It 
is not desirable to have overly complex mixtures unless required. While splitting a (parent) 
component, the initial means of the two children should be reasonably apart so that they 
form distinct components and the A"-component mixture has the best chance to move from 
a sub-optimal state to a more optimal state (if one exists). After the inital means are cho¬ 
sen, an EM is carried out just on the two child components until they are stabilized, leaving 
the remaining (K — l)-components unchanged. After optimizing the children, they are then 
integrated with the original K — 1 components and an EM is subsequently performed on 
the K + 1 components to reach an optimal state. If the new (K + l)-component mixture 
results in a lower total message length, that implies the perturbation of the JF-comp orient 
mixture resulted in an improved mixture. 
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Selection of initial means of the two child components: In the case of Gaussian distributions, 
Kasarapu and Allison (2015) chose the initial means such that they are one standard devi¬ 
ation away on either side of the component mean along the direction of maximum variance. 
In the present work, for directional distributions (vMF and FB 5 ) defined on the three- 
dimensional spherical surface, we provide an analogous form. As described in Section 4.1, 
the procedure for moment estimation of the major and minor axes of a FB 5 distribution 
involves the eigenvalue decomposition of the matrix B^, the submatrix derived from the 
dispersion matrix B. If Zi and I 2 are the eigenvalues of B^ (Equation 6 ), then I 1 J 2 are 
roots of the characteristic equation: 

l 2 ~ if>22 + &33^ + &22^33 “ ^23 = 0 so that l\ + l 2 = ^22 + b 33 

According to Equation 8 , we have h — I 2 = r 2 - and hence, l\ = (622 + ^33 + r *2 )/2. The 
maximum variance is along the direction of major axis and is equal to the eigenvalue l\ . 
Hence, one standard deviation would correspond to y/lf. It is to be noted that these 
calculations are done in the X 2 X 3 plane which contains the major and minor axes (that is, 
after the mean of the parent, as part of moment estimation, is aligned with Xi). However, 
it is now required to map this point back onto the unit sphere. 

Consider Figure 4(a) where 72 and 73 are the major and minor axes in the X 2 X 3 plane 
respectively. The mean axis 7 ^ of the parent component being split is aligned with Xi. 
The segment OP is of length y/lf corresponding to unit standard deviation along Let 
M\ be the mean of one of the children. Then, for M 1 such that M\P is perpendicular 
to the X 2 X 3 plane, we have M\P = \/l — l\ (as OM\ = 1 is the radius of the sphere). If 
6 6 [0,180°] measures the co-latitude of the mean M\ as shown, we have 9 = arccos y/1 — If. 
The mean M 2 (not shown in the figure) of the second child component lies in the plane 
containing OM\P such that the angle between OM 2 and OX 1 is 9. The two means are then 
transformed in order to conform with the axes of the parent FB 5 component. With these 
as starting points for the EM algorithm, the two child components are locally optimized. 
The children along with the untouched (K — l)-components serve as a starting point for 
estimating the parameters of the (K + l)-component mixture using the EM algorithm. 

8.4.2 Deletion of a component 

While deleting a component, its memberships are adjusted by proportionally distributing 
among the remaining K — 1 components. With this new starting point, the parameters of 
the ( K — l)-component mixture are estimated using an EM algorithm. 

8.4.3 Merging two components 

The choice of merging a pair of components is determined by their closeness. To identify 
the closest component, Kasarapu and Allison (2015) compute the Kullback-Leibler (KL) 
divergence (Kullback and Leibler, 1951) of the component in consideration with the re¬ 
maining K — 1 components in the mixture. The chosen pair is then merged to form a 
single component whose initial weight and memberships are given by the sum of the indi¬ 
vidual components’ weights and memberships respectively. This acts as the starting point 
for the EM algorithm to estimate the parameters of the merged {K— l)-component mixture. 
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Kullback-Leibler divergence of FB§ distributions: The analytical form of the KL divergence 
between two FB 5 distributions is derived below. The KL divergence between two probability 
distributions f a and fb is defined as 


D KL (fa\\fb)=^a 



/a( x ) ~ 

/&( x ). 


where E a [.] is the expectation of the quantity [.] using f a . 

Let f a (x) = FB 5 (K a , fa, Q a ) and fb (x) = FB^fab, fa, Q 5 ) be two distributions such that 
Qa = (7ai,7a2,7a3) and Q b = (761,762,763)- Let c a and c b be the respective normalization 
constants. Then, 


E 


a 



fa fa) ' 

fbfa). 


= log — + faalll ~ Kblbl) E « [ x ] 

C a 

+ fa 7 J 2 E a [ xx;T ] 7a2 - fa 762 [ xxT ] 762 


- fa 7 J 3 E q t xxT ] 7a3 + fa 763 E a [ xxT ] 763 


(30) 


gives the analytical form of the KL divergence of two FB 5 distributions. The expressions 
for E 0 [x] and E (l [xx T ] are derived in Equation 15. 




(a) 


(b) 


Figure 4: (a) Selection of initial means of children during splitting a parent component (see 
Section 8.4.1). (b) Transformation between spherical and Cartesian coordinates. 


Through these perturbations, the search method aims to find an optimal state by lever¬ 
aging information about the sub-optimal state. In doing so, we are cautiously splitting, 
deleting, or merging potential candidate components. The heuristic attempts to find im¬ 
proved mixtures without compromising the optimality of the intermediate solution. 
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8.5 An illustrative example of the search procedure 

The mechanics of the inference of a suitable FB 5 mixture model has been explained pre¬ 
viously in Section 8.4. For further details of the search method, we refer the reader 
to Kasarapu and Allison (2015). To better illustrate the search process, this subsection 
presents a detailed example. 

Consider a mixture with three FB 5 components (Figure 5) that have equal mixing 
proportions, the same concentration parameter k = 100 and different eccentricities. The 
red component has eccentricity e = 0.1 and the angular parameters defining its axes are 
(ip, a, if) = (0,60°, 45°). The green component has e = 0.5 and ( ip,a,r /) = (150°, 45°, 30°). 
The blue component has e = 0.9 and (ip,a,r/) = (30°, 45°, 60°). The parameters are chosen 
such that the components are close to each other. A sample of size N = 1000 was generated 
from the mixture using the method of Kent et al. (2013). The mixture density is shown as 



Figure 5: Original mixture consisting of three components with equal weights and n = 100. 

(a) individual components with varying eccentricities: e = 0.1 (red), e = 0.5 
(green), and e = 0.9 (blue) (b) simulated data plotted in degrees in the 0cp space 
(contours encompass 90% of the data), 


a heat map in Figure 5(b). For ease of visualization, the density is represented in 9(p space, 
where 6 is the co-latitude and (p is the longitude (Figure 4(b)). The Cartesian coordinates 
of each datum x = (x\, X 2 , X 3 ) in the sampled data are transformed into the spherical 
coordinates defined by unit radius, co-latitude, and longitude. The transformation 4 is 
effected by: 

x\ = cos#, X ‘2 = sin# cos (p, £3 = sin# sin^ 


4. It is to be noted that transforming data generated from a FB 5 distribution (that has elliptical contours on 
the spherical surface) and representing in the 9<f> space produces shapes that do not have any decipherable 
pattern as can be seen through this example. 
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8.5.1 The seach method explained 

The search begins by inferring a one-component mixture M 1 (Figure 6 a). It has an asso¬ 
ciated message length of I = 19364 bits. Before splitting the component, the means of the 
children are initialized as shown in Figure 6 (b). These means are determined as explained 
in Section 8.4.1. The children are optimized using the EM algorithm to generate the two- 
component mixture M 2 (Figure 6 c). M 2 has a message length of I = 19319 bits, and hence 
improves M\ by 45 bits. 




(a) Mi (I = 19364 bits) (b) Initialization of means (c) M 2 (I = 19319 bits) 


Figure 6 : Iteration 1 (a) one-component mixture, (b) Red colour denotes the parent compo¬ 
nent being split, the red dot indicates the mean of the parent and the black dots 
(on either side) indicate the initial means of the children, (c) improved mixture. 


In the second iteration, each of the two components in M 2 are split, deleted, and 
merged. Figure 7(a)-(c) illustrates the splitting of component P\. After integrating the 
optimized children and subsequently optimizing the resulting 3-component mixture using 
an EM algorithm, an improved mixture M 3 is obtained. Figure 7(d)-(f) illustrates the 
splitting of component P 2 ■ In this case, splitting Po results in the same 3-component 
mixture M 3 . It is to be noted that while splitting Pi and P 2 produce different intermediate 
states, as shown in Figure 7(b) and (e), the EM converges to the same optimal state in these 
cases. Figure 8 (a)-(f) illustrate the deletion of Pi and P 2 . While their deletions also have 
different intermediate starting points, as shown in Figure 8 (b) and (e), the EM algorithm 
results in the same sub-optimal state (same as M\). As this one-component mixture has a 
greater message length than that of M 2 , the deletion operations do not result in improved 
mixtures. The merging of Pi and P 2 components, as shown in Figure 8 (g)-(i), also does 
not improve on M 2 - Hence, after the second iteration, it is observed that amongst all 
perturbations, the splitting of Pi or Po results in an improved mixture M 3 . 

In the third iteration, all perturbations are carried out exhaustively. Figure 9 depicts 
the splitting, deletion, and merging of one of the three components (Pi) in M 3 . During 
splitting, observe the initial selection of means of the child components. The procedure 
outlined in Section 8.4.1 faithfully separates the two children and results in a mixture 
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(d) Splitting P 2 (7 = 19319) (e) Optimized children (7 = 19333) (f) M 3 : post-EM (7 = 19143) 


Figure 7: Iteration 2 - Split operations, (a)-(c) splitting the first component P\ in A4o, 
and (d)-(f) splitting the second component Po in Ad 2. The red dashed lines in 
(b),(e) represent the optimized children (prior to integration), and black dashed 
lines in (b),(e) represent the unchanged components. 


with a greater number of components. However, in this case, the optimized mixture AI4 
(Figure 9c) does not improve the message length. Similarly, the deletion of Pi does not lead 
to an improved mixture (Figure 9f). While merging Pi, KL divergence is used to determine 
an appropriate candidate that is closest. Accordingly, the pair is selected (Figure 9g) which 
also does not result in an improved mixture (Figure 9i). The other two components in AI3 
are also perturbed similarly. However, the operations do not result in an improvement (the 
series of steps and the resulting mixtures are included in Appendix C). 

8.5.2 Variation of the two-part message length 

Let us now explain the evolution of the mixture model in terms of the two-part message 
length (Equation 29). While increasing the number of mixture components leads to in¬ 
creased mixture complexity, the fit to the data improves. The first part of the message 
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(a) Deleting Pi (7 = 19319) (b) Before optimizing (7 = 23019) 


(c) post-EM (7 = 19364) 



(d) Deleting P 2 (7 = 19319) 


(e) Before optimizing (7 = 20416) 


(f) post-EM (7 = 19364) 



(g) Merging Pi and P 2 (7 = 19319) (h) Before optimizing (7 = 19364) (i) post-EM (7 = 19364) 


Figure 8: Iteration 2 - Deletions and Merging (green colour represents the component being 
deleted and blue denotes the pair being merged). 


corresponds to the overhead related to encoding the mixture parameters (number of com¬ 
ponents, weights, and constituent components’ parameters). The second part mainly cor- 
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(a) Initialize means of children 
90 r 


(b) Optimized children (7 = 19186) 
90 


(c) Mi : post-EM (7 = 19174) 
90 



(d) 


(e) Before optimizing (7 = 20741) (f) post-EM (7 = 19320) 



Figure 9: Iteration 3 - perturbations of Pi (a)-(c) splitting, (d)-(f) deletion, (g)-(i) merging 


responds to the negative log-likelihood of the data using a given mixture model. In the 
previous example, the search method infers three components and terminates thereafter. 
The message lengths corresponding to the optimal mixtures during the associated search 
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process are plotted in Figure 10. It is observed that, until K = 3, the total message length 
(green curve) decreases. We wanted to examine the variation of the message length be¬ 
yond the inferred number of components. For this, starting from K = 4 until K = 10, we 
estimated the mixture parameters using the EM algorithm (Section 8.2) for each value of 
K > 3. The results indicate that the total message length steadily increases beyond K = 3. 
The reason is that although the negative log-likelihood of the data decreases (with increas¬ 
ing K), the second part of the message (blue curve) only changes marginally, while the first 
part continually increases. Thus, as mixtures become overly complex, there is a greater 
cost associated with encoding their parameters. This affects the total message length as the 
minimal gain in negative log-likelihood is overshadowed by the increase in the first part of 
the message. Hence, this example demonstrates the effectiveness of the search method in 
the context of FB 5 distributions. Furthermore, it also demonstrates the ability of the MML 
criterion to balance the tradeoff between the model complexity and the quality of data fit. 



Figure 10: Variation of the individual parts of the total message length with increasing 
number of components (note that the two Y-axes have different scales: the first 
part of the message follows the right side Y-axis; while the second part of the 
message and total message lengths follow the left side Y-axis) 


9. Experimental analyses of the various parameter estimates 

For a given FB 5 distribution characterized by concentration k and eccentricity e, a random 
sample of size N is generated using the method proposed by Kent et al. (2013). We set the 
true distribution to have {ip,a,r]} = it/2 each. The scale parameters k and e are varied to 
obtain different FB 5 distributions and corresponding random samples. The parameters are 
estimated using the sampled data and the different estimation methods. The procedure is 
repeated 1000 times for each combination of N, k, and e. 


29 










Kasarapu 


9.1 Methods of comparison 

We conduct a comparison between the moment, maximum likelihood (ML), MAP, and 
MML-based estimates. The results include the two versions of MAP estimates resulting 
from the two forms of the posterior distributions (Equation 13): MAPI corresponds to 
the posterior with parameterization k, /3, and MAP2 corresponds to the posterior with 
parameterization k, e. The two versions are considered so as to show that MAP estimates 
are inconsistent and are dependent on the parameterization used. 

The MML estimates are obtained by minimizing the message length expression. Natu¬ 
rally, the estimates due to other methods do not result in lower message lengths. Similarly, 
if we use the negative log-likelihood as the comparison criterion, the maximum likelihood 
estimates have a lower value compared to the other estimates. As each estimation tech¬ 
nique optimizes a different objective function, it is required to have a metric that impartially 
evaluates the different estimates. The mean squared error of the estimates and Kullback- 
Leibler (KL) divergence (Kullback and Leibler, 1951) are therefore used to compare the 
various estimates. The estimates are also compared using statistical hypothesis testing. 

9.1.1 Mean squared error of the estimates 

For a parameter vector 0, and its estimate 0, the mean squared error (MSE) is given by 
E[(© — ©) 2 ]. Further, the MSE can be decomposed into bias and variance terms, as given 
below (Lebanon, 2010; Taboga, 2012). 

E[(@ — 0) 2 ] = Bias 2 (0) + trace(Var(©)) 

where Bias 2 (@) = ||E[0] — 0 |[ 2 and Var(0) is the covariance matrix of the estimator. 
Ideally, it is expected that the estimates result in low MSE values and it depends on the 
bias and variance of the parameter estimates. An estimate that results in lower values of 
MSE is usually preferred over the other estimates. 

9.1.2 Kullback-Leibler (KL) DIVERGENCE of the estimated distribution 

The KL divergence is a similarity measure that is used to determine the “distance” between 
the true distribution and the distribution thats uses the estimated parameters. An estimate 
that results in lower KL divergence is considered a better estimate. The analytical form of 
the KL divergence between two FB 5 distributions is derived in Section 8.4.3. We report the 
percentage of times (out of 1000 random simulations) that the KL divergence of a particular 
estimator is lower than that of others. An estimator wins when its associated KL divergence 
is less than that of the other estimates. 

When the KL divergence of different estimates is compared, because of two different 
versions of MAP estimation, we present two separate frequency plots. The KL divergence 
of the moment, ML, and MML estimates is contrasted with KL divergence of MAPI or 
MAP2 estimates. 

9.1.3 Statistical hypothesis testing 

The likelihood ratio test is typically used to determine the suitability of modelling data V 
using a simpler or a nested model, corresponding to one with fewer free parameters (null hy¬ 
pothesis Hq) against a more general model (alternate hypothesis Pa)- The likelihood ratio 
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A is used to determine the preference of a null hypothesis Ho over an alternate hypothesis 
Ha, and is as follows: 

max likelihood (V\Ho) 

^ _ Ho _ 

max likelihood(P|'HA) 

Ha 

The test statistic resulting from the use of A is related to the negative logarithm of the likeli¬ 
hood ratio and is given by A = —2 log A. The distribution of the statistic A is asymptotically 
approximated as a x 2 distribution with degrees of freedom equal to the difference in the 
number of free parameters between the alternate and the null hypothesis (Wilks, 1938). If 
A is sufficiently small, it would lead to a rejection of the null hypothesis. Conversely, if A 
exceeds some confidence threshold, Ho is rejected. 

In the current analysis of the various parameter estimates, we compare the likelihood 
ratio resulting from the use of a particular estimate 0 (that is, moment, MAP or MML- 
based) against a general FB 5 distribution. It is equivalent to testing the null hypothesis 
Ho : © = © (explicit parameters) against the alternate hypothesis Ha ■ © / © (with 5 
free parameters). Assuming a statistical significance of the test as 1 %, Ho is rejected when 
A > r, where r = 13.086 corresponds to the 99 th percentile of a x 2 distribution with 5 
degrees of freedom. Alternatively, the test statistic can be used to evaluate the p-value, 
which if less than 1 % (significance of the test) amounts to rejection of Ho- 

For the various parameter estimates compared here, it is expected that at especially large 
sample sizes, the estimates are close to the maximum likelihood estimate as determined by 
the corresponding test statistic. In other words, the empirically determined test statistic is 
expected to be lower than the critical value r, which implies it has a corresponding p-value 
greater than 0 . 01 . 

9.2 Empirical analysis 

The estimates are analyzed here in two controlled cases: (1) fixing sample size N with 
varying k and e, and (2) varying N while fixing k and e. 

9.2.1 Fixed sample size, varying concentration k and eccentricity e 

The results are presented when a random sample of size N = 10 is generated from the 
FB 5 distribution for a k that is increased by an order of magnitude starting from 1 to 100. 
The behaviour of the estimates is analyzed below. 

• k = 1: The performance of the various estimates using the comparison methodologies 
(Section 9.1) is illustrated in Figure 11. It is observed that the bias and MSE of 
moment and ML estimates is greater than that of MAP and MML-baserl estimates. 
The two versions of the MAP estimates also have a greater bias and MSE as compared 
to the MML estimates shown in Figure 11(a) and (b). 

It is also observed that the MML-based estimates result in lower KL divergence more 
than 80% of time as compared to other estimates when MAPI is used (see Figure 11c). 
With MAP2, the frequency of wins for the MML-based estimates increases to more 
than 90% (see Figure lid). This suggests that transforming the parameter space 
greatly impacts the MAP-based estimates. The ML estimates win less than 5% of 
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the time. This is in agreement with the relatively greater MSE observed for the ML 
estimates. 

The boxplots shown in Figure 11(e) and (f) show the variation of the test statistics 
and the corresponding p-values. There is a greater variation for the MML-based 
estimates. However, across all values of eccentricity, the test statistic A is less than 
the threshold t = 13.086 and the smallest p-value is greater than 0.01. This is true 
across all estimation methods, thus, suggesting that the null hypothesis of modelling 
data using a particular estimate (moment, MAP or MML) is accepted at the 1% 
significance level. 

• k = 10: The comparison results are presented in Figure 12. Similar to the previous 
case (ft = 1), the moment and ML estimates have greater bias and MSE. It is inter¬ 
esting to note that MAP2 has greater bias and MSE compared to MML estimates 
(Figure 12(a) and (b) respectively). However, MAPI estimates are in close competi¬ 
tion with the MML. The bias and MSE are lower for MML estimates until e < 0.5 
and greater compared to MAPI estimates for e > 0.5. 

The number of times KL divergence is lower for the MML estimates decreases with 
increasing eccentricity (for both versions of MAP estimates). For e < 0.5, the percent¬ 
age of wins for the MML estimates is greater than all other estimates. However, for 
e > 0.5, MAPI wins majority of the time (Figure 12c). In the case of comparison with 
MAP2, the percentage of wins of MML estimates continuously decreases. However, 
the number of wins of MML estimates is always in the majority (Figure 12d). 

The observations are in contrast to what was observed in the case of ft = 1 where MML 
estimates emerged as consistently better estimates. In terms of statistical hypothesis 
testing, the null hypotheses corresponding to modelling using moment, ML, MAP or 
MML estimates are accepted at the 1% significance level. 

• ft = 100: The comparison results in this case follow the same pattern as that of ft = 10 
(not illustrated here as they are similar to Figure 12). 

When ft = 10 and 100, the MAPI estimates perform competitively compared to the MML 
estimates (with respect to bias and MSE). Further, the proportion of times MAPI estimates 
win with respect to KL divergence progressively increases as the eccentricity increases. 
In general, similar results are observed for k > 10. However, as discussed previously, 
MAP-based estimation is subjective to the parameterization of the distribution as shown 
by the stark contrast between MAPI and MAP2 estimates by the two paranreterizations 
even though they are both reasonable. The moment, ML, and MML estimates, on the 
other hand, are not affected by parameterization. Amongst these, MML-based estimates 
outperform with respect to all objective metrics as described here. 

9.2.2 Varying sample size N, fixed concentration k and eccentricity e 

We have also explored the behaviour of different estimates with increasing sample size from 
N = 10 to N = 50. For space reasons, we only include the results for k = 10. The results 
are discussed for three specific eccentricity values, ranging from low eccentricity (e = 0.1), 
to moderate eccentric (e = 0.5) to high (e = 0.9). 
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Figure 11: N = 10, k = 1. 


• e = 0.1: The comparison results are presented in Figure 13 which clearly shows how, 
across all estimators, the bias and MSE decrease as N increases. This is expected: as 
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(f) Variation of p-values 


Figure 12: IV = 10, k = 10. 


more data becomes available, the accuracy of estimation increases. Figure 13(a) and 
(b) illustrate that the bias and MSE are prominent for moment and ML estimators. 
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The bias of MML estimates is close to zero and convincingly lower than both versions 
of MAP estimates, especially when N < 25. The MSE of MML estimates is smaller 
but close to that of MAPI estimate. 

The proportion of wins of MML estimates with respect to KL-divergence is the high¬ 
est with values of at least 70% and 80% when compared with MAPI and MAP2, 
respectively (see Figure 13c,d). Also, hypothesis testing indicate that the respective 
estimates constituting the null hypothesis are accepted at the 1 % significance level, 
as observed from the boxplots of test statistics and p-values in Figure 13(e) and (f). 

• e = 0.5: The comparison results are presented in Figure 14. Similar to the previous 
case, the bias and MSE of moment and ML estimates are considerable high compared 
to those of the MAP and MML estimates. Also, MAPI estimates have greater bias 
and MSE as compared to MAP2 estimates. In this case, the bias and MSE of MAP2 
and MML are close to zero. 

The proportion of wins of MML estimates with respect to KL divergence is higher with 
about 40% and 50% when compared against MAPI and MAP2 estimates, respectively. 
The proportion of wins are, however, lower compared to the previous case when e = 0.1 
as shown in Figure 14(c) and (d). 

• e = 0.9: The comparison results are presented in Figure 15. In this case, again, the 
bias and MSE of moment and ML estimates are greater compared to others. For 
N < 20, the bias of MML estimates is greater when compared to those of MAPI and 
MAP2 (Figure 15a). Further, the MSE of MML estimates is greater than that of 
MAPI and lower than that of MAP2. As the MSE combines the bias and variance 
terms, there is a tradeoff that leads to this result. When IV > 25, there is almost no 
difference in the bias and MSE due to MAP and MML estimates. 

Also, the proportion of wins of KL divergence for MAPI is greater than all others 
(Figure 15c). This corresponds to the proportion of wins as illustrated through Fig¬ 
ure 12(c), similar to the N = 10, k = 10, e = 0.9 case. However, when compared with 
MAP2, the MAIL estimates have greater proportion of wins ~ 40% (Figure 15d). 

The traditional ML estimators are known to have considerable bias, especially at lower 
sample sizes (Dryden and Mardia, 1998; Dore et ah, forthcoming). The ML estimates of k in 
the case of a vMF distribution are known to be biased (Schou, 1978; Best and Fisher, 1981; 
Cordeiro and Vasconcellos, 1999). Similarly, the ML estimates of a Bingham distribution, 
which is a special case of a FB 5 distribution, are also shown to be biased and corrections have 
been proposed (Cordeiro and Klein, 1994; Kume and Wood, 2007; Dore et ah, forthcoming). 

The MML-based estimates have been shown to be effective in reducing bias in the case 
of a vMF distribution (Kasarapu and Allison, 2015). For an FB 5 distribution, we empiri¬ 
cally demonstrated that, in comparison to the moment and ML estimates, the MML-based 
estimates have lower bias and MSE. Further, when compared to MAP estimates, MML 
estimates are competitive, particularly considering that MAP estimates are dependent on 
the parameterization. As a result, MAP estimates are inconsistent and should therefore be 
avoided. In contrast, MML-based estimates are invariant to alternative parameterizations 


35 



Test statistic % of wins Bias squared 


Kasarapu 





(c) KL divergence (MAP version 1) 


(d) KL divergence (MAP version 2) 


6 


5 


MOMENT : 
MAPI c 
MML c 
MAP2 c 


4 


3 



-L 

35 


Sample size 



MAPI i-1 

MML i - 1 

MAP2 r^~i 

0.3 L 

10 15 20 25 30 35 40 45 50 

Sample size 


(e) Variation of test statistics 


(f) Variation of p-values 


Figure 13: k = 10, eccentricity = 0.1 


(Oliver and Baxter, 1994; Wallace, 2005). In this regard, we discuss another parameteriza- 
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Figure 14: k = 10, eccentricity = 0.5 


tion in Appendix A involving all parameters of an FB 5 distribution to further strengthen 
our case. 
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Figure 15: k = 10, eccentricity = 0.9 

10. Experiments involving FB 5 mixtures 

To demonstrate the applicability of FB 5 mixtures, we consider the problem of mixture 
modelling of directional data arising out of protein three-dimensional conformations. A 
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protein chain consists of a sequence of amino acids (residues). Each residue has a central 
carbon atom C a . If C l a and CT +1 denote the carbon atoms at positions i and i + 1 in the 
protein chain, then the distance between these successive atoms is highly constrained to 
be 3.8A because of the chemical interactions between the constituent atoms. Thus, C 
atom lies on a sphere of radius ~ 3.8A whose centre is C l a . The direction vector from C l a 
to C l a +1 is considered a point in the data set. Given the Cartesian coordinates of a C a 
atom, its co-latitude ( 9 ) and longitude (4>) are determined with respect to the previous C a 
atom in a consistent manner (Kasarapu and Allison, 2015). The set of all {0,<f>) pairs form 
the directional data corresponding to a given set of protein structures. The protein data 
set considered is the publicly available ASTRAL SCOP-40 (version 1.75) database (Murzin 
et ah, 1995). Out of the entire dataset, the u /3 class” proteins comprising of 1802 structures 
is a case in point. The empirical distribution consists of 251,346 {0,<f>) pairs and we infer 
mixtures on this directional data using the search method described in Section 8.4. 

10.1 Evolution of vMF and FB 5 mixtures 

Mixtures of vMF distributions were previously explored by Kasarapu and Allison (2015). 
This entailed estimating the vMF concentration parameter k using MML. They use Taylor 
series approximations (Newton’s and Halley’s root-finding methods) in the computation 
of the MML estimates of k. Both these root-finding approaches are truncated after two 
iterations in order to do a fair comparison with the other contemporary k approximations 
that were discussed in that work (see Equations 8 and 9 in Kasarapu and Allison (2015)). 
The obtained vMF k estimates were used as part of the mixture modelling apparatus. As a 
result, the search method employed for determining the optimal number of vMF components 
was terminated prematurely. Furthermore, the search heuristic employed in Kasarapu and 
Allison (2015) does a random selection of initial means of the child components while 
splitting a parent component. In contrast to these, the vMF k estimates used in the current 
work correspond to the converged values (without truncating prematurely). Further, as per 
the search method described in this work, while splitting a parent component, the initial 
means of the children are chosen such that they are reasonably apart which gives them 
the best chance to form two distinct sub-components in order to escape a local optimum 
(explained in Section 8.4.1). 

The search method infers a 37-component vMF mixture and terminates after 49 itera¬ 
tions involving split, delete, and merge operations. When modelled using FB 5 distributions, 
the search method infers 23 components and terminates after 33 iterations. In each of these 
iterations, for every intermediate /L-component mixture, each constituent component is 
split, deleted, and merged (with an appropriate component) to generate improved mix¬ 
tures. The method terminates when these perturbations do not result in an improvement. 

In the case of vMF mixture, the search method begins with a one-component mixture, 
continuously favours splits over delete and merge operations until a 17-component mixture 
is inferred. This corresponds to the steady increase in the first part of the message length 
as observed by the red curve in Figure 16(a) until the 17 th iteration. Thereafter, a series 
of deletions and splits result in an intermediate sub-optimal 19-component mixture at the 
end of the 23 rd iteration. This is characterized by the step-like behaviour of the red curve 
between the 17 th and 24 th iteration. The first part of the message is dependent on the 
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number of components (model complexity) and an increase in number of mixture compo¬ 
nents leads to an increase in the encoding cost of the parameters. From the 24 th iteration, 
the method continues to split the constituent components until a 36-component mixture 
is inferred after 40 iterations. This is reflected in the continuous rise of the red curve in 
Figure 16(a) between 25 th and 40 th iterations. Thereafter, through a series of perturbations, 
the final resultant mixture has 37 components at the end of 49 iterations, characterized by 
a step-like behaviour towards the end between 40 th and 49 th iterations. 

In the case of FB 5 mixture, the search method infers a 23-component mixture at the end 
of 23 iterations by continuous splitting. This corresponds to the steady increase in the first 
part of the message length denoted by the red curve in Figure 16(b). From here on, after 
a series of perturbations, the final mixture stabilizes at the end of 33 rd iteration thereby 
resulting in a 23-component mixture. This is characterized by the step-like behaviour 
corresponding to intermediate reduction and increase in the number of mixture components 
between the 24 th and 33 rd iterations. 



Iterations 
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(a) vMF mixture 


(b) FB 5 mixture 


Figure 16: Evolution of mixtures inferred by the search method. Note there are two Y-axes 
in both (a) and (b) with different scales: the first part of the message follows the 
right side Y-axis (red); while the second part and total message lengths follow 
the left side Y-axis (black). 


In both cases, the second part of the message length continues to decrease with an 
increase in the number of mixture components. An initial sharp decrease is observed in both 
mixture types. The search method terminates when the increase in first part dominates the 
reduction in the second part leading to an increase in total message length. 

10.2 Comparison of vMF and FB 5 mixture models 

The resulting vMF and FB 5 mixtures are shown in Figure 17. In order for effective visual¬ 
ization of the individual mixture components, the illustration includes the contours of the 
components such that they encompass 80% of the probability corresponding to each compo¬ 
nent. The data plotted is a random sample of 10000 pairs drawn from the empirical 
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distribution of (3 class of proteins. The regions in Figure 17 are coloured based on the empir¬ 
ical distribution (heat map). There are two distinguishable regions of the distribution of 0 
and 0 values. At ( 9 , < j>) ~ (90°, 60°), there is a concentrated mass which corresponds to the 
helical region in a typical protein. The area characterized by 6 E (40°, 80°), (/> E (180°, 240°) 
roughly corresponds to the strand region in a typical protein. 

The search method inferred a 37-component vMF mixture and a 23-conrponent FB 5 mix¬ 
ture. It is observed that the number of components used to model the entire collection of 
251,346 (9, c/>)-pairs using a FB 5 mixture model is fewer compared to a vMF mixture. This 
is expected as a vMF distribution is a specific case of a FB 5 distribution and, hence, a 
vMF mixture requires more number of components to model data that is asymmetrically 
distributed. In Figure 17(a), the vMF mixture components 1-11 are used to model the 
helical region (approximately), whereas in Figure 17(b), the same region is modelled us¬ 
ing FB .5 mixture components 1-6. Similarly, the strand region in the proteins is modelled 
by components 12-17 in the vMF case, whereas, it is modelled by components 1-11 using 
FB 5 mixture. Further, components 18-24 in the vMF mixture and components 12, 13 in the 
FB 5 mixture model the same region. The other regions in the protein directional data space 
follow the same modelling pattern, that is, with fewer FB 5 components. These observations 
reflect the better explanatory power of FB 5 mixtures compared to vMF mixtures. 

Compared to a singleton vMF distribution, the encoding cost of the parameters of a 
FB 5 distribution would be greater as it is a complex model with more number of parameters. 
As shown in Table 1, the encoding cost of the parameters of the inferred 23-component 
FB 5 mixture is 1095 bits. A vMF mixture with the same number of components has a 
first part equal to 749 bits (a difference of 346 bits). However, the second part of the 
message (the fit to the data) is lower for the FB 5 mixture (a difference of ~ 17,000 bits). 
Hence, the gain in the second part outweighs the greater cost of encoding the more complex 
FB 5 mixture. Thus, the total message length is lower for the FB 5 mixture and serves as 
a better model to explain the data. If the 23-component vMF mixture is compared with 
the 37-component vMF mixture inferred by our search method, the vMF mixture with 23 
components has smaller first part and greater second part (Table 1). The 37-component 
mixture has a first part equal to 1177 bits compared to 749 bits in the 23-conrponent case 
(a difference of 428 bits). However, there is a gain of 11,000 bits in the second part, thus, 
resulting in a lower total message length in the 37-conrponent case. Through this analysis, 
it is shown how the tradeoff of choosing a complex model and the quality of fit is addressed 
using the MML framework. 


Mixture 

model 

Number of 
components 

First part 
(thousands of bits) 

Second part 

Total message length 

(millions of bits) 

vMF 

23 

0.749 

5.490 

5.491 

vMF 

37 

1.177 

5.479 

5.481 

fb 5 

23 

1.095 

5.473 

5.474 


Table 1: Message lengths corresponding to mixtures inferred on the protein directional data. 
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Figure 17: Mixtures inferred on the /3-class proteins (9 and 4> are hi degrees). 


It is also interesting to note the shape of the contours generated by both vMF and 
FB 5 mixtures. A vMF distribution caters to symmetrically distributed data and has circular 
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contours of constant probability on a spherical surface. Hence, in the Ocj) space, we see 
regular oval-shaped contours as shown in Figure 17(a). In contrast, a FB 5 distribution has 
ellipse-like contours on a spherical surface (Figure 2). Thus, when projected onto the 9<p 
space, it results in a myriad of contour shapes (Figure 17b) depending on the parameters 
defining a FB 5 distribution. 

Compressibility of protein structures: The better explanatory power of FB 5 mixtures over 
vMF mixtures leads to enhanced data compression (demonstrated through Table 1), and 
hence, serve as efficient descriptors to model directional data. In the context of proteins, 
previous null model descriptors are based on the uniform distribution on the sphere (Kon- 
agurthu et ah, 2012), and due to vMF mixtures (Kasarapu and Allison, 2015). The null 
model descriptions provide a baseline for encoding protein coordinate data in varied struc¬ 
ture modelling tasks (Konagurthu et al., 2012, 2013; Collier et ah, 2014). In this regard, 
the use of FB 5 mixture offers a better alternative as opposed to encoding using uniform 
distribution or vMF mixtures. 

The message length expressions to encode the orientation angles using uniform, vMF, 
and FB 5 null models are given by Equation 31 (Konagurthu et ah, 2012; Kasarapu and 
Allison, 2015), where x corresponds to a unit vector described by (8,(j>) on the surface of 
the sphere, e is the precision 5 to which each coordinate is measured, and r denotes the 
distance between successive C a atoms. In Equation 31, for the vMF mixture, K = 37 and 
the null model corresponding to the FB 5 mixture has K = 23 components. 


Uniform Null 


vMF & FB 5 Null 


lo g 2 ( 4 ^ 2 ) = lo S 2 ( 47 r) - 21 og 2 0 


47T7-2 

( K 

lo S2 

\i =1 


- 21 og 2 (- 


bits. 


bits. 


(31) 


The inferred mixture models are then used to encode the entire protein data. After ac¬ 
counting for the distances between the successive atoms, the total message lengths obtained 
are given in Table 2. The uniform distribution is clearly not an appropriate descriptor and 
this can be reasoned from the empirical distribution which has multiple modes (Figure 17). 
The inferred vMF mixture has better explanatory power over the uniform distribution as 
it has a corresponding saving of 446,000 bits over 251,346 data points (residues). This 
translates to an enhanced compression of 1.778 bits per residue (on average). The inferred 
FB 5 mixture, however, encodes the same amount of data with a saving of 7,000 bits against 
the vMF mixture (an average of 0.026 bits extra compression per residue). The results fol¬ 
lowing the application of FB 5 mixtures to modelling protein directional data demonstrate 
that that they supersede the vMF mixture models (Table 2). The ability of FB 5 distribu¬ 
tions to model asymmetrical data leads to improved encoding of the protein data. Hence, 
they serve as natural successors to the vMF null model descriptors. 


10.3 Comparison of MML criterion with other information-theoretic criteria 

The MML criterion is used in computing the score associated with a mixture model by 
separately encoding the parameters (first part) and the data given those parameters (second 

5. Protein coordinate data is measured to an accuracy of e = O.OOlA. 
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Null model 

Total message length 

Bits per 

(millions of bits) 

residue 

Uniform 

6.895 

27.434 

vMF mixture 

6.449 

25.656 

FB 5 mixture 

6.442 

25.630 


Table 2: Comparison of the null model encoding lengths based on uniform distribution, 
vMF mixture (37 components), and FB 5 mixture (23 components). 


part). This yields the total message length (Equation 29) which is used to find improved 
mixtures during the search process. In addition to the MML criterion, as discussed in 
Section 8.3, the traditional information-theoretic criteria used are AIC (Akaike, 1974) and 
BIC (Schwarz, 1978; Rissanen, 1978). These two criteria introduce constant term penalties 
depending on the number of free parameters in the mixture model. If p denotes the number 
of a model’s free parameters 6 , £(D|4>) is the minimized negative log-likelihood of data given 
the parameters <1>, and N the sample size, then AIC and BIC are given by 

AIC(p) =p + C{V\$) and BIC(p) = | log N + C(V\$) 

Mixture modelling of some observed data based on these criteria can be done as follows: 

• Exhaustive search: The search heuristic (Section 8.4) to determine the optimal mixture 
can be used alongside any objective function and not necessarily the MML criterion. 
The series of perturbations are carried out as described and the improvement to 
mixtures is determined based on the criterion in use. 

• Traditional search: As discussed in Section 8.3, the traditional search method using 
AIC/BIC involves estimating the mixture parameters using the EM algorithm (Sec¬ 
tion 8.1) for varying number of components K and choosing the one which results in 
minimum criterion value. 

It is to be noted that with the MML criterion, the EM algorithm in Section 8.2 is used to 
obtain the MML estimates of the mixture parameters. However, with AIC and BIC, the 
EM algorithm in Section 8.1 results in the maximum likelihood (ML) estimates for a given 
K. For FB 5 mixtures, the ML estimates are often approximated by the moment estimates 
which are used in the M-step of the EM algorithm (Peel et ah, 2001; Kent and Hamelryck, 
2005; Hamelryck et al., 2006). We compare the results for mixtures obtained using the ML 
estimates and their approximations against those obtained using MML-based estimates. 

The results pertaining to the exhaustive search method are shown in Table 3. It is 
observed that when search is based on AIC, the mixtures resulting due to moment and 
ML estimation have 37 and 34 components respectively. The moment and the ML mixtures 
have the same AIC values in this case. With BIC, the mixture resulting from ML estimation 
has the lower BIC value. In this case, the moment and the ML mixtures have 23 and 24 

6 . The number of free parameters in a FB 5 mixture with K components is p = 5 K + (K — 1) = 6 K — 1. 
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components respectively. This number resembles the one obtained by the exhaustive search 
but MML-based parameter estimation. The ML mixture has the lowest BIC score. 



Moment mixtures 

Maximum likelihood mixtures 

Criterion 


Criterion 

Message 


Criterion 

Message 

K 

value 

length 

K 

value 

length 



(xlO 5 bits) 

(xlO 6 bits) 


(xlO 5 bits) 

(xlO 6 bits) 

AIC 

37 

2.313 

5.474 

34 

2.313 

5.474 

BIC 

23 

4.647 

5.475 

24 

4.645 

5.474 


Table 3: FB 5 mixtures inferred by employing the exhaustive search method and changing 
the evaluation criteria and methods to estimate mixture parameters. 


The results pertaining to the traditional search method are shown in Figure 18. As the 
number of components K is increased, it is expected that the AIC and BIC scores decrease 
until some minimum is reached and then increase thereafter. The value of K at which 
this behaviour happens is treated to be the optimal mixture that models the data. It is 
observed that initially, both criteria decrease and after K = 30, the values do not change 
dramatically. By increasing K , the linear increase in penalty factors and the associated 
increase in log-likelihood are of the same magnitude, and hence, the difference in criteria is 
not apparent. Thus, using the traditional search, it is difficult to decide on an appropriate 
number of mixture components. 

The trend observed in Figure 18 is the same for mixtures obtained using both moment 
and ML estimates. The expressions for AIC and BIC do not help in distinguishing the 
moment and ML mixtures because for different types of estimates and a given K, the 
penalty terms are the same. Also, the log-likelihood is approximately the same because for 
huge amounts of data, as is the case here, all the estimates converge to the same value. 

In contrast, if we compute the first part message lengths corresponding to the moment 
and ML mixtures, for a given K , the differences in their encoding lengths become apparent. 
The variation in the first part message lengths for the moment and ML mixtures resulting 
from the traditional search are shown in Figure 20. It is observed that until K = 30, the first 
part message lengths of moment and ML mixtures are close to each other. In Figure 20(b), 
when K > 30, there are minute differences between encoding lengths of mixture parameters 
obtained using moment and ML estimates. Thus, unlike AIC/BIC, the MML criterion is 
able to distinguish mixtures with equal number of components. The first part corresponds 
to the model complexity and is dependent on not just the number of components K but 
also on the components’ parameters themselves according to the MML framework. 

The above discussion is aimed at projecting the limitations of the traditional search 
method and also the use of AIC and BIC as evaluation criteria. We find in MML an 
objective way to assess mixtures and in conjunction with the search method offers a better 
alternative to determine reliable mixture models. We further illustrate the behaviour of the 
search method with smaller amount of data. The previous discussion pertains to the entire 
empirical data set containing N = 251,346 (0,(f>) pairs. In the current context, from the 
empirical distribution, we randomly sample varying amounts of data ranging from N = 1000 
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Figure 18: The criteria computed for max¬ 
imum likelihood mixtures (mo¬ 
ment mixtures have the same 
behaviour and are hence not 
shown) 



Figure 19: Variation of the number of in¬ 
ferred components using the 
search method based on exhaus¬ 
tive perturbations. 




(a) (b) 

Figure 20: First part message length corresponding to mixtures evaluated using AIC. The 
results for BIC display the same pattern and are hence not shown, (the range 
of I\ E [1, 50] is split into two sub-figures (a) and (b) in order to highlight the 
differences in the message lengths). 


to N = 20,000. The experiment is conducted by fixing the search method (exhaustive) 
but changing the evaluation criteria to infer suitable FB 5 mixtures. It is observed that 
mixtures based on AIC have greater number of components as compared to BIC and MML 
(see Figure 19). The mixtures corresponding to BIC and MML have the same number of 
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components in most of the experimental trials. These results are in agreement with what 
was observed on the complete protein data (Table 3), where AIC resulted in greater number 
of components. 

11. Conclusion 

We derived the parameter estimates of a FB 5 distribution defined on a three-dimensional 
unit sphere based on the Bayesian information-theoretic minimum message length criterion. 
The derived estimators have lower bias and mean squared error compared to the tradition¬ 
ally used moment and maximum likelihood estimators. The MML-based estimates are also 
invariant to transformations of the parameter space unlike the MAP estimates. Hence, the 
MML-based estimates are improvements over the traditionally used estimators. Further, 
we have designed the mixture modelling apparatus to be used in conjunction with FB 5 mix¬ 
tures and demonstrated their applicability in modelling real-world directional data resulting 
from protein spatial orientations. The results obtained from modelling using FB 5 mixtures 
is contrasted with commonly used vMF mixtures. The FB 5 mixture models supersede the 
vMF models in describing protein data, and serve as improved null model descriptors that 
are important to modelling tasks in structural biology. 
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Appendix A. Prior density of © governed by the n prior for the 2D vMF 


In the MML estimation of parameters of a vMF distribution on a circle, Wallace and 

We discuss this prior additionally as this leads to 


Dowe (1994) use h(n) = —- - , . 

(1 + k z ) 6 ' z 

an invertible transformation of all five parameters (described below) in the context of a 


FB 5 distribution. As in Section 5.1, hA(ip,a,i]) = 


sm a 
47 T 2 


and h((3\n) = 2 /k. Hence, the joint 


prior density h® is formulated as shown below. Further, using the eccentricity transform 
described in Section 5.2.1, the joint prior density h®/ in & parameterization is given below. 


h®(ip, a, 17, k , (3) = 


sma 


27 t 2 (1 + k 2 ) 3 / 2 


and h®> (ip, a,r), k, e) = 


k sm a 


47 T 2 (1 + K 2 ) 3 / 2 


A.l An alternative parameterization of the parameter vector © 

In addition to the eccentricity transform, we study a transformation proposed by Rosen¬ 
blatt (1952), that transforms a given continuous k -variate probability distribution into the 
uniform distribution on the £:-dinrensional hypercube. Such a transformation applied on the 
prior density of the FB 5 parameter vector © results in the prior transforming to a uniform 
distribution. Hence, estimation in this transformed parameter space is equivalent to the cor¬ 
responding maximum likelihood estimation. For the 5-parameter vector © = [ip, a, rj, k, f3}, 
the Rosenblatt (1952) transformation to 0" = {z±, z 2 , z 3 , 24 , 25 } is given by 

21 = Pr(Xi <if)= F\(ip) 

z 2 = Pr(X 2 < a\Xi = ip) = F 2 (a\ip) 

z 3 = Pr(X 3 < r/\X 2 = a, X 3 = ip) = F 3 (r]\a,ip ) 

24 = Pr(X 4 < n\X 3 = p, X 2 = a,X\ = ip) = F±(K\q, a, ip) 

25 = Pr(X 5 < /3\X 4 = K ,X 3 = g, X 2 = a,X 1 = ip) = F 5 (f3\n, rj, a, ip) 
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This transformation results in 0 < z* < 1. i = Further, Rosenblatt (1952) argues 

that each z, L is uniformly and independently distributed on [ 0 , 1], so that the prior density 
in this transformed parameter space is /i©»(zi, Z2, z 3 , Z4, Z5) = 1. In order to achieve such 
a transformation, we need to express z % in terms of the original parameters. As per the 
definitions of the prior on 0 (Section 5.1), the following relationships are derived: 

Z\ = 'l/j/lT =£- if) = 7 TZi 

Z2 = (1 — cosa )/2 a = arccos(l — 2Z2) 

z 3 = 77 /( 271 -) => ?? = 2ttz 3 (32) 


Based on the independence assumption in the formulation of priors of angular and scale 
parameters (Section 5.1), Z 4 = 7 * 4 ( 077 , a, i/r) = F^n). Similarly, F^(P\k, 77 , a, ij>) = Fs(/3|k). 
Hence, the invertible transformations corresponding to k and f3 are as follows: 


Z4 = I hjujdn = 


'0 (1 + ft 2 ) 3 / 2 


(Ik = 1 — cos(arctan k) 


£5 = F 5 (/3\k) = 2 P/k =r (3 = KZ 5/2 


k = tan(arccos(l — Z4)) 

(33) 


With the 3D version of vMF k prior (see Section 5.1), Z 4 evaluates to — arctan k -^ 

7 r \ 1 + k z 

This version of Z 4 is not invertible as it does not allow us to express k as a closed form ex¬ 
pression in Z4. Hence, the Rosenblatt (1952) transformation is discussed only in the context 
when 2D vMF k prior is considered, as it is possible to find an inverse transformation. 


A.2 The example demonstrating the effects of alternative parameterizations 

The above discussed prior and its variants are used in the MAP-based parameter estimation 
of the data from the example discussed in Section 5.3. The resulting estimates of -0, cr, 77 
are given below using: 

h&:^= 2.070, a = 1.493, rj = 1.522 
h&, : %jj = 2.070, a = 1.493, rj = 1.522 
Zi©// : z) = 0.659, % = 0.461, z 3 = 0.242 

As observed, ijj, a, and rj are the same when posteriors corresponding to /i© and h©/ are 
used. In the case of /?,©//, the mapping of zi,Z 2 ,z 3 back to -0, cr, 77 (Equation 32), results in 
the same estimates as that of /i© and /;.©/. Hence, the MAP estimates of a, 77 are the 
same across the different versions. The estimates of k and f3 are, however, not the same 
under the various transformations. They are as follows depending on the parameterization: 

/?.© : k = 16.975, = 5.467 

/?.©/ : k = 20.547, e = 0.701 (3 = K,e/2 = 7.205 

h@" : Z 4 = 0.964, % = 0.779 k = 28.065, /3 = 10.925 (as per Equation 33) 

The estimated value of k using /?.© is 16.975 whereas it is 20.547 using /i©/. The value of e 
corresponds to a (3 = 7.205. Similarly, the value of k and (3 corresponding to Z 4 = 0.964 and 
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£5 = 0.0.779 are 28.065 and 10.925 respectively. Clearly, the value of the parameter esti¬ 
mates depend on the parameterization. However, it is required that the estimates obtained 
in different parameterizations should be the same irrespective of the space in which the 
parameters are defined. However, through this example, it is observed that for the various 
parameterizations, the value of MAP estimates differ. 

The variation of the posterior density under various transformations of the parameter 
space are shown in Figure 21. These are plotted as a function of k, (5 (in case of /i@), k, e 
(in case of h®/), and Z 4 , Z 5 (in case of /i®»), each reflecting the space in which the posterior 
is defined. It is observed that the modes of the respective posterior distributions occur at 
different positions and they are not equivalent to each other. The posterior density plots 
in Figure 21(b) and (c) correspond to those in Figure 21(d) and (e) respectively. Ideally, 
(the modes in) Figure 21(a)-(c) should be the same. However, as demonstrated, that is 
not the case. Thus, maximizing the posterior density does not yield consistent estimates as 
observed through this example. 



(d) h & (e) h & n 


Figure 21: Heat maps depicting the modes (MAP estimates) of the posterior density re¬ 
sulting from different parameterizations. 


Appendix B. The partial derivatives of 71,72,73 with respect to ^,0,7 

The first and second order partial derivatives of the axes are required for the evaluation 
of the elements of the Fisher information matrix (see Section 6.2.1). The expressions for 
71 , 72,73 as a function of ip,a,r] are given by Equation 5. 
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Derivatives of 71: 


dji 

da 


9 2 7 i 
da 2 


= - 7 i 


9 2 71 



9tl 

9?? 


- sm a cos 77 
\ — sin a sin 77, 


( 


0 


sm a sm 77 
y sin a cos 77 

/ 


9 2 7 i 

dr/da 


0 


- cos a sm 77 
\ cos a cos 7/ 


The partial derivatives of 71 involving the parameter ip are zero vectors. 
Derivatives of 72: 


c*72 , <972 

_ = _ cosV , 7 i, _ 


<9 2 72 , $7i 9 2 72 

^-- COS ^’ 9^" 


/ 0 

— cos ip cos a sin 77 — sin 1/7 cos 7/ 
\ cos 1/7 cos a cos 7/ — sin 1/7 sin 77 

( 0 

— cos ip cos a cos 77 + sin ip sin 77 

cos 7/7 cos a sin 77 — sin -0 cos 77y 


<972 

90 

< 9 2 72 

dip 2 


= 73 


= -72 


<9 2 72 , <97i 

= -COS0—— , 


dr/da 
Derivatives of 73: 

973 . , ^73 


9 2 73 . , 97i 9 2 7 3 

9 a 2 “ Sm ^ 9 a ’ 9; ? 2 


9 2 7 2 

909 a 


= sin 07i, 


9 2 7 2 

dip drj 


dr] 

( 0 

sin 0 cos a sin 7/ — cos ip cos 77 

0— sin 0 cos a cos 77 — cos ip sin 17 j 

( 0 

= sin 0 cos a cos 17 + cos ip sin 77 

ysin 0 cos a sin 77 — cos 0 cos 77 y 


<973 

9?7 


973 


9 2 7 3 _ . , 97i 9 2 7 3 

9779a Sm drj ’ dipda 


= cos07i 


< 9 2 73 
dip dr] 


dip 

< 9 2 73 

dip 2 

972 

dr] 


= -72 


= -73 


Appendix C. The search process continued 

The illustrations presented here are continuation of the example discussed in Section 8 . 5 . 1 . 
Figure 22 illustrates the perturbations carried out on the component P2 in the mixture AI3. 
None of the split, delete, and merge operations involving P-2 result in improved mixtures. 
The same is the case with component P3 (depicted in Figure 23 ). It is interesting to note 
the different mixtures obtained by splitting P2 and P3 in Figure 22 (c) and Figure 23 (c) 
respectively. These 4 -component mixtures are different to the mixture obtained by splitting 
Pi (Figure 9 c). Also, the mixtures resulting from deleting and merging of P2 and P3 are 
different when compared to the mixtures obtained by the same operations on Pi. This 
example demonstrates how the search method evaluates various competing mixtures and 
selects the one which has the least overall message length. 
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(g) 


(h) Before optimizing (7 = 19244) (i) post-EM (7 = 19236) 


Figure 22: Iteration 3 - perturbations of P 2 (a)-(c) splitting, (d)-(f) deletion, (g)-(i) merging 
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(a) Initialize means of children (b) Optimized children (7 = 19174) 


(c) post-EM (7 = 19168) 



(d) 


(e) Before optimizing (7 = 20743) 


(f) post-EM (7 = 19237) 



Figure 23: Iteration 3 - perturbations of P 3 (a)-(c) splitting, (d)-(f) deletion, (g)-(i) merging 
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